triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
703 stars 104 forks source link

Qwen2-14B inference garbled #601

Open kazyun opened 1 month ago

kazyun commented 1 month ago

System Info

When using Qwen2, executing inference with the engine through the run.py script outputs normally. However, when using Triton for inference, some characters appear garbled, and the output is incomplete compared to the results obtained from using the script. What could be the cause of this issue?

maybe the config.pbtxt cause the problem

Who can help?

No response

Information

Tasks

Reproduction

  1. start triton server

Expected behavior

get the same results with run.py script

actual behavior

When using Qwen2, executing inference with the engine through the run.py script outputs normally. However, when using Triton for inference, some characters appear garbled, and the output is incomplete compared to the results obtained from using the script. What could be the cause of this issue?

additional notes

no

kazyun commented 1 month ago

This issue only occurs when using a streaming request. payload = { "text_input": QWEN_PROMPT_TEMPLATE.format(input_text=prompt), "max_tokens": max_tokens, "stream": True, }

response = requests.post(server_url, json=payload, stream=True)
will-jay commented 1 week ago

This issue only occurs when using a streaming request. payload = { "text_input": QWEN_PROMPT_TEMPLATE.format(input_text=prompt), "max_tokens": max_tokens, "stream": True, }

response = requests.post(server_url, json=payload, stream=True)

Hi, I have the same problem. Is there any solution?