problem with streaming - Githubissues

System Info

A100

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I run this: curl -X POST my_ip:8000/v2/models/ensemble/generate_stream -d '{"text_input": "hello", "max_tokens":250, "temperature":0.00001, "top_p":0.95, "top_k":1, "repetition_penalty":1.2, "stream":true, "end_id":128009, "random_seed":1}'

But the stream is not received smoothly. For example, 100 tokens are received at once every 3 seconds.

Expected behavior

stream receive smoothly

actual behavior

stream is not received smoothly

additional notes

if i remove dynamic_batching in: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.14.0/all_models/inflight_batcher_llm/postprocessing/config.pbtxt the problem will solve but the speed is still slow.

triton-inference-server / tensorrtllm_backend

problem with streaming #640