Open metterian opened 2 months ago
Hi @metterian , thanks for your feedback. Are the performance data you show based on triton? If so, could you please try to use only TRT-LLM (not based on triton) (preferably with warmup).
We expect that the overhead caused by this feature on TRT-LLM is limited and acceptable.
System Info
Intel(R) Xeon(R) CPU @ 2.20GHz Architecture: x86_64 NVIDIA A100-SXM4-40G Ubuntu
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I follow official examples for Llama model: https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama
I've been experiencing significant slowdowns when the return_context_logits flag is turned on. For context, I am utilizing the llama example and have specifically enabled the gather_context_logits flag during the TensorRT-LLM build process.
Additionally, I have been passing return_context_logits through the triton_client in an attempt to retrieve logits for the request sentences. To accommodate this, I have set the request_output_len or output_len to 1.
Expected behavior
The anticipated behavior when enabling return_context_logits would be a manageable decrease in speed, ideally not significantly deviating from the throughput when the flag is off. Performance should ideally be on par with or better than the forward pass speed of HuggingFace implementations.
actual behavior
The current observed behavior shows an almost 8-fold decrease in execution speed when trying to obtain logits with a maximum length of 1. This is surprisingly slower than the forward pass speed of comparable HuggingFace models.
Here's a comparative table of performance with and without the
return_context_logits
flag:additional notes
I have executed the trtllm-build with the following configuration:
Any insights or assistance in addressing this unexpected slowdown would be greatly appreciated. If there are any further experiments or specific areas you would recommend investigating, please advise.