triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
711 stars 108 forks source link

problem with streaming #640

Closed Alireza3242 closed 1 week ago

Alireza3242 commented 2 weeks ago

System Info

A100

Who can help?

@kaiyux

Information

Tasks

Reproduction

I run this: curl -X POST my_ip:8000/v2/models/ensemble/generate_stream -d '{"text_input": "hello", "max_tokens":250, "temperature":0.00001, "top_p":0.95, "top_k":1, "repetition_penalty":1.2, "stream":true, "end_id":128009, "random_seed":1}'

But the stream is not received smoothly. For example, 100 tokens are received at once every 3 seconds.

Expected behavior

stream receive smoothly

actual behavior

stream is not received smoothly

additional notes

if i remove dynamic_batching in: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.14.0/all_models/inflight_batcher_llm/postprocessing/config.pbtxt the problem will solve but the speed is still slow.

Alireza3242 commented 2 weeks ago

I solved my problem with #622.