Open jcuquemelle opened 2 months ago
The performance degradation might be attributed to interference between the two composing models. From the report, it indeed looks like the requests are not getting blocked. What is the CPU utilization in all the three cases?
Can you try running model_analyzer for the ensemble to get the most performant model configuration for your use case? https://github.com/triton-inference-server/model_analyzer/blob/main/docs/ensemble_quick_start.md
Hi @tanmayv25
The Cpu usage is already mentioned for all 3 use cases, in a ballpark the resource usage for the ensemble is the sum of resource usage for preprocessing + inference, but we only get 8500 QPS against 21500 for the slowest of 2 composing models.
We didn't use model_analyser on this case because it's not our final use case which will be more complex, but we'd like to understand where the bottlenecks comes from on a simpler setup
Description We have an ensemble of 2 models chained together (description of models below).
Calling only the "preprocessing" model yields a max throughput of 21500 QPS @ 6 Cpu cores usage Calling only the "inference" model yields a max throughput of 44000 QPS @ 6 Cpu + 0.7 Gpu usage Calling the "ensemble" model yields a max throughput of less than 8500 QPS @ 10 Cpu + 0.7 Gpu usage
Why is there such a performance drop when using the ensemble ? given the figures for individual models, I'd expect the output of the ensemble to be around the same as the individual model with the less throughput i.e. 21500 QPS
The server-side metrics don't show any sign of blocking step in the inference pipeline although client-side the latency increases over time:
Triton Information Running in py3 offical container, version 24.06, on a Kube pod with 16 Cpus and 32GB RAM.
To Reproduce Composing models configurations: ensemble.txt preprocessing.txt inference.txt
preprocessing: Onnx model on CPU, dynamic batching Inference: Onnx model on Gpu with TRT execution provider, dynamic batching
Expected behavior The output of the ensemble is around the same as the individual model with the less throughput i.e. 21500 QPS