triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

Qwen2-14B generate_stream return some garbled code #606

Open kazyun opened 6 days ago

kazyun commented 6 days ago

Description stream request return garbled code

Triton Information tritonserver 24:08 run container with this image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3

To Reproduce Steps to reproduce the behavior.

This issue only occurs when using a streaming request. v2/models/tensorrt_llm_bls/generate_stream (both ensemble) payload = { "text_input": QWEN_PROMPT_TEMPLATE.format(input_text=prompt), "max_tokens": max_tokens, "stream": True, }

The screenshot below shows the results of non-streaming and streaming requests. Dingtalk_20240924143637

Expected behavior same result with v2/models/tensorrt_llm_bls/generate

oandreeva-nv commented 3 days ago

Hi @kazyun , thanks for reporting. Could you please provide the reproducer, if possible?

kazyun commented 1 day ago

Hi @kazyun , thanks for reporting. Could you please provide the reproducer, if possible?

Just by setting up streaming requests, sometimes the responses may contain individual garbled characters. In the screenshot above, the input prompt=”第五项修炼“, the issue can be resolved by setting bad_word=["1."], but other cases where some garbled characters in the response cannot be resolved."

Simply turn on accumulate_tokens resolves the garbled character issue, so I personally believe it should be related to the decoding capability of the tokens.