Closed RunningLeon closed 4 months ago
cc @ywang96 if you can help answer
hi @RunningLeon , how did you end up solving this? could you please give some insight?
hi @RunningLeon , how did you end up solving this? could you please give some insight?
@geraldstanje hi, you can refer to this comment https://github.com/triton-inference-server/tensorrtllm_backend/issues/453#issuecomment-2111521451
@RunningLeon thanks - could you kindly post all your parameters you selected for vllm to run - for llama3 8b?
Proposal to improve performance
Hi, how to test tensorrt-llm serving correctly? I've tested on llama2-8b-chat and llama3-8b and the performances are too bad for TTFT. Could you tell me where goes wrong? THX
I use docker image
nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
and follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.mdThis is the results for request rate=7
related issue: https://github.com/triton-inference-server/tensorrtllm_backend/issues/453
Report of performance regression
ran script:
Misc discussion on performance
No response
Your current environment (if you think it is necessary)