Open dyoshida-continua opened 1 month ago
Another detail which is interesting is that the identical sequences I observe in the concurrent case are the same run to run, even though I'm sampling the random seed from 1-1,000,000.
For example, with the input of <|begin_of_text|>Hello, my name is
, I saw a continuation of of "Ahmed, and I am an experienced Software Engineer with proficiency..."
in 3/5 responses, and then 2/5 responses on the next run. I did not observe this prefix at all when making requests serially.
@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.
@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.
@dyoshida-continua I applied the solution described in this pull request: NVIDIA/TensorRT-LLM#1742, and it resolved the issue for me.
Thank you for the help replying, @chiendb97 . Since the https://github.com/NVIDIA/TensorRT-LLM/pull/1742 is related to fix of random seed setting, it might be related to your issue, @dyoshida-continua . Could you take a try?
System Info
I've converted Llama 3 using TensorRT-LLM's convert_checkpoint script, and am serving it with the inflight_batcher_llm template. I'm trying to get diverse samples for a fixed input, but I've found that if I make several requests concurrently, several will have identical outputs.
I'm setting
top_p=1, top_k=1024, temperature=1.0, beam_width=1
, and generating a unique random seed for each request. The requests are being made over the gRPC API, and I'm using v0.9.0 of TensorRT-LLM and tensorrtllm_backend.Who can help?
@byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
I expect each request with a different seed to yield a different response
actual behavior
Several of the 5 responses are consistently identical
additional notes
I changed the script I'm using for testing to wait for a response before sending another request, and this results in all 5 outputs being distinct, so it seems like the concurrency/inflight batching really is the problem.