Closed Alireza3242 closed 1 week ago
A100
@kaiyux
examples
I run this: curl -X POST my_ip:8000/v2/models/ensemble/generate_stream -d '{"text_input": "hello", "max_tokens":250, "temperature":0.00001, "top_p":0.95, "top_k":1, "repetition_penalty":1.2, "stream":true, "end_id":128009, "random_seed":1}'
But the stream is not received smoothly. For example, 100 tokens are received at once every 3 seconds.
stream receive smoothly
stream is not received smoothly
if i remove dynamic_batching in: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.14.0/all_models/inflight_batcher_llm/postprocessing/config.pbtxt the problem will solve but the speed is still slow.
I solved my problem with #622.
System Info
A100
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I run this: curl -X POST my_ip:8000/v2/models/ensemble/generate_stream -d '{"text_input": "hello", "max_tokens":250, "temperature":0.00001, "top_p":0.95, "top_k":1, "repetition_penalty":1.2, "stream":true, "end_id":128009, "random_seed":1}'
But the stream is not received smoothly. For example, 100 tokens are received at once every 3 seconds.
Expected behavior
stream receive smoothly
actual behavior
stream is not received smoothly
additional notes
if i remove dynamic_batching in: https://github.com/triton-inference-server/tensorrtllm_backend/blob/v0.14.0/all_models/inflight_batcher_llm/postprocessing/config.pbtxt the problem will solve but the speed is still slow.