triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

`random_seed` seems to be ignored (or at least inconsistent) for inflight_batcher_llm #468

Open dyoshida-continua opened 1 month ago

dyoshida-continua commented 1 month ago

System Info

I've converted Llama 3 using TensorRT-LLM's convert_checkpoint script, and am serving it with the inflight_batcher_llm template. I'm trying to get diverse samples for a fixed input, but I've found that if I make several requests concurrently, several will have identical outputs.

I'm setting top_p=1, top_k=1024, temperature=1.0, beam_width=1, and generating a unique random seed for each request. The requests are being made over the gRPC API, and I'm using v0.9.0 of TensorRT-LLM and tensorrtllm_backend.

Who can help?

@byshiue

Information

Tasks

Reproduction

  1. Serve a model (essentially following this guide, with some settings changes: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/)
  2. Make 5 gRPC requests concurrently

Expected behavior

I expect each request with a different seed to yield a different response

actual behavior

Several of the 5 responses are consistently identical

additional notes

I changed the script I'm using for testing to wait for a response before sending another request, and this results in all 5 outputs being distinct, so it seems like the concurrency/inflight batching really is the problem.

dyoshida-continua commented 1 month ago

Another detail which is interesting is that the identical sequences I observe in the concurrent case are the same run to run, even though I'm sampling the random seed from 1-1,000,000.

For example, with the input of <|begin_of_text|>Hello, my name is, I saw a continuation of of "Ahmed, and I am an experienced Software Engineer with proficiency..." in 3/5 responses, and then 2/5 responses on the next run. I did not observe this prefix at all when making requests serially.

dyoshida-continua commented 3 weeks ago

@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.

chiendb97 commented 3 weeks ago

@byshiue I incorrectly typed your name when opening this issue originally. Can you comment on whether there's a workaround for this? It's currently making batch inference effectively useless.

@dyoshida-continua I applied the solution described in this pull request: NVIDIA/TensorRT-LLM#1742, and it resolved the issue for me.

byshiue commented 3 weeks ago

Thank you for the help replying, @chiendb97 . Since the https://github.com/NVIDIA/TensorRT-LLM/pull/1742 is related to fix of random seed setting, it might be related to your issue, @dyoshida-continua . Could you take a try?