triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
663 stars 96 forks source link

Multiple outputs in sampling #499

Open tonylek opened 3 months ago

tonylek commented 3 months ago

Is there an option to get multiple outputs when sampling? For example for specific top_p and temperature I want to return 2 options.

I know I can get multiple options when using beam_search but I need it in sampling setup.

byshiue commented 3 months ago

In such case, you could

  1. Pass same requests several time together with different random seeds. Then, you should get different results. This direction is more straightforward.
  2. Pass the request first, and then pass it several times next with different random seeds. If you enable the reuse_kv_cache, the time of context phase of the second time would be short and it is recommended.