triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.49k forks source link

50k-60k infer/sec limitation #7590

Open v-hyhyniak-crt opened 2 months ago

v-hyhyniak-crt commented 2 months ago

Description Hello NVIDIA Team,

During tests, we have seen a limit of ~50k-60k infer/sec per tritonserver instance regardless of the complexity of the model, number of model instances, and with only partial CPU and GPU utilization.

Triton Information Setup in kubernetes:

To Reproduce

Expected behavior As long as the resources are partially used (max 5-6 cpu out of 20 available), we would expect an increase of infer/sec when increasing instance count (ic). In our tests, we had: 1 ic: ~25k infer/sec | 2 ic: ~50k infer/sec | 4 ic: ~50k infer/sec | 8 ic: ~54k infer/sec..

We have also tested different setups and options, some of them:

:grey_exclamation: running 2 tritonserver instances in the same container (with --reuse-grpc-port) actually doubles the throughput to 100k infer/sec. This may imply that there is no container- or resource-related bottleneck, but rather a limitation inside the tritonserver itself.

:question: Taking above-stated into account, I was wondering if you could please give more information about:

Thank you!