Description
I run benchmark of Meta-Llama-3-8B-Instruct in RTX 8*4090,
when request is 16, input sequence length is 1024, output sequence length is 1024, The TTFT(time to first token) is 0.403s, which is acceptable。
However, when request is 1024,
The TTFT is 379.089s. Is this normal?
Triton Information
TensorRT-LLM:v0.9.0
tensorrtllm_backend: v0.9.0
Are you using the Triton container or did you build it yourself?
yes
To Reproduce
Description I run benchmark of Meta-Llama-3-8B-Instruct in RTX 8*4090, when request is 16, input sequence length is 1024, output sequence length is 1024, The TTFT(time to first token) is 0.403s, which is acceptable。 However, when request is 1024, The TTFT is 379.089s. Is this normal?
Triton Information TensorRT-LLM:v0.9.0 tensorrtllm_backend: v0.9.0
Are you using the Triton container or did you build it yourself? yes To Reproduce
Expected behavior The TTFT is lower