Description
I'm running load testing to see what is the best configuration to achieve the highest QPS under the constraint of latency. I noticed that increasing instance count doesn't help improve the concurrency, though we have enough CPU cores and memory. Similar issue seems also reported in issue #7579
Triton Information
What version of Triton are you using?
Tested on v24.01 and v24.08
Are you using the Triton container or did you build it yourself?
Docker container
To Reproduce
I'm testing a model on 2 triton pods, 3 instances for each.
If serving one model only, we can at most serve around 3K QPS. Triton queue time (nv_inference_queue_summary_us) is long, almost the same as Triton end-to-end latency. The inference compute time (nv_inference_compute_infer_summary_us) is almost negligible. Increasing the instance count from 3 to higher number (6, 9) doesn't help reduce the queue time and doesn't help improve the QPS either.
If serving multiple (8, 16, etc) models at the same time, the total QPS can be as high as 8K
I have allocated as many CPU cores as possible, which is not a bottleneck.
Expected behavior
More parallel computation is done when creating more instances. The scheduling system should be able to process more requests with more instances, given that the CPU and input/output is not a bottleneck.
Description I'm running load testing to see what is the best configuration to achieve the highest QPS under the constraint of latency. I noticed that increasing instance count doesn't help improve the concurrency, though we have enough CPU cores and memory. Similar issue seems also reported in issue #7579
Triton Information What version of Triton are you using? Tested on v24.01 and v24.08
Are you using the Triton container or did you build it yourself? Docker container
To Reproduce
I'm testing a model on 2 triton pods, 3 instances for each.
If serving one model only, we can at most serve around 3K QPS. Triton queue time (
nv_inference_queue_summary_us
) is long, almost the same as Triton end-to-end latency. The inference compute time (nv_inference_compute_infer_summary_us
) is almost negligible. Increasing the instance count from 3 to higher number (6, 9) doesn't help reduce the queue time and doesn't help improve the QPS either.If serving multiple (8, 16, etc) models at the same time, the total QPS can be as high as 8K
I have allocated as many CPU cores as possible, which is not a bottleneck.
defaultconfig.pbtxt
:Expected behavior More parallel computation is done when creating more instances. The scheduling system should be able to process more requests with more instances, given that the CPU and input/output is not a bottleneck.