triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
663 stars 96 forks source link

Accumulation of tokens while beam_width > 1 #513

Open wxsms opened 3 months ago

wxsms commented 3 months ago

System Info

tensorrt_llm==0.11.0.dev2024061800

Who can help?

@ncomly-nvidia

Information

Tasks

Reproduction

deploy a model with beam_width > 1 and trtllm backend, request the BLS model with geneate_stream endpoint and stream: true

Expected behavior

the accumulate_tokens should be able to True

actual behavior

error thrown: Accumulation of tokens is only implemented for beam width = 1

additional notes

Maybe all we need to do is enhance the BLS script I think?