Remove GenAi-Perf known issue (echo is now disabled by default)

triton-inference-server / client

Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

BSD 3-Clause "New" or "Revised" License

551 stars 227 forks source link

Remove GenAi-Perf known issue (echo is now disabled by default) #601

Closed dyastremsky closed 5 months ago

dyastremsky commented 5 months ago

We removed the echoing of the input prompt in the output for the Triton backends in GenAi-Perf. This is no longer a known issue.

Now, vLLM has exclude_input_in_output set to true by default.

For TRT-LLM, the user must enable or disable exclude_input_in_output in their model config. We mention it in the help message for the CLI arg --output-tokens-mean.

dyastremsky commented 5 months ago

There is no way to detect this and warn the user during usage correct?

If you are asking about the TRT-LLM case, I don't believe so. I suppose we could check if the output tokens are much greater than the expected output tokens if the user specifies an expected output token count and report it back with the results. However, that seems like overkill and couples GenAi-Perf and TRT-LLM too closely, I think. The user will get their output token count, so they can see for themselves if it matches their expectation.

TRT-LLM may provide the ability to disable echo via a request parameter in the future, at which point we can use that feature in GenAi-Perf to disable it by default.

debermudez commented 5 months ago

There is no way to detect this and warn the user during usage correct?

If you are asking about the TRT-LLM case, I don't believe so. I suppose we could check if the output tokens are much greater than the expected output tokens if the user specifies an expected output token count and report it back with the results. However, that seems like overkill and couples GenAi-Perf and TRT-LLM too closely, I think. The user will get their output token count, so they can see for themselves if it matches their expectation.

TRT-LLM may provide the ability to disable echo via a request parameter in the future, at which point we can use that feature in GenAi-Perf to disable it by default.

Is it worth while to call this out if we are using TRTLLM then? Maybe have GenAI-perf print a log.info statement?

dyastremsky commented 5 months ago

Thanks for the approval! We spoke offline. If TRT-LLM does not provide this option soon, we can add logging for transparency.