Open danielchalef opened 6 months ago
Hi @danielchalef,
Is this unusual for other models you have tested? Is this latency value consistently present across all test runs with this model and other TRT LLM models?
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.
Description We're seeing significant latency in the order of 300-600 hundred milliseconds between COMPUTE_END and REQUEST_END on a TensorRT-LLM model. See OTEL trace image below.
Triton Information What version of Triton are you using? 2.42.0
Are you using the Triton container or did you build it yourself? NGC Container 24.01
To Reproduce
See the config.pbtxt below.
Expected behavior REQUEST_END should occur very shortly after COMPUTE_END, perhaps on the order of tens of ms.