Closed dyastremsky closed 5 months ago
There is no way to detect this and warn the user during usage correct?
If you are asking about the TRT-LLM case, I don't believe so. I suppose we could check if the output tokens are much greater than the expected output tokens if the user specifies an expected output token count and report it back with the results. However, that seems like overkill and couples GenAi-Perf and TRT-LLM too closely, I think. The user will get their output token count, so they can see for themselves if it matches their expectation.
TRT-LLM may provide the ability to disable echo via a request parameter in the future, at which point we can use that feature in GenAi-Perf to disable it by default.
There is no way to detect this and warn the user during usage correct?
If you are asking about the TRT-LLM case, I don't believe so. I suppose we could check if the output tokens are much greater than the expected output tokens if the user specifies an expected output token count and report it back with the results. However, that seems like overkill and couples GenAi-Perf and TRT-LLM too closely, I think. The user will get their output token count, so they can see for themselves if it matches their expectation.
TRT-LLM may provide the ability to disable echo via a request parameter in the future, at which point we can use that feature in GenAi-Perf to disable it by default.
Is it worth while to call this out if we are using TRTLLM then? Maybe have GenAI-perf print a log.info statement?
Thanks for the approval! We spoke offline. If TRT-LLM does not provide this option soon, we can add logging for transparency.
We removed the echoing of the input prompt in the output for the Triton backends in GenAi-Perf. This is no longer a known issue.
Now, vLLM has
exclude_input_in_output
set to true by default.For TRT-LLM, the user must enable or disable
exclude_input_in_output
in their model config. We mention it in the help message for the CLI arg--output-tokens-mean
.