50k-60k infer/sec limitation

Description Hello NVIDIA Team,

During tests, we have seen a limit of ~50k-60k infer/sec per tritonserver instance regardless of the complexity of the model, number of model instances, and with only partial CPU and GPU utilization.

Triton Information Setup in kubernetes:

tritonserver: nvcr.io/nvidia/tritonserver:24.06-py3, 20 cpu, 32G mem
perf-analyzers (x4): nvcr.io/nvidia/tritonserver:24.06-py3-sdk, 20 cpu, 32G mem

To Reproduce

Using any simple model (to avoid heavy inference computation), in my case it was a simple ONNX model that converts int32[1] to float32[1] (with and without batching).
Model config: config.txt, tested with 1-2-4-8 instance count.
Starting tritonserver in K8s container: tritonserver --model-repository $REPO_PATH --model-control-mode=poll --repository-poll-secs 10
Testing with 1-4 simultaneous perf-analyzers in K8s containers: perf_analyzer -m model_name --percentile=80 --request-rate-range=12000:40000:1000 -a -u [tritonserver_ip]:8001 -i grpc --input-data random --measurement-interval 6000 --max-threads=32 --string-length=16 --collect-metrics --metrics-url [tritonserver_ip]:8002/metrics --verbose-csv -v. Each perf-analyzer used alone was able to generate 25k-30k infer/sec load, so having 4 of them simultaneously should be enough to get loads higher than 60k infer/sec

Expected behavior As long as the resources are partially used (max 5-6 cpu out of 20 available), we would expect an increase of infer/sec when increasing instance count (ic). In our tests, we had: 1 ic: ~25k infer/sec | 2 ic: ~50k infer/sec | 4 ic: ~50k infer/sec | 8 ic: ~54k infer/sec..

We have also tested different setups and options, some of them:

Adding 2nd similar simple model and querying both of them resulted in the same total ~50-60k infer/sec
No batching, dynamic batches (1-500) do not help
Testing different environments:
- in the docker on the bare machine (without K8s) - same threshold
- in the Marathon application (container orchestration platform for Mesos) - no such threshold

:grey_exclamation: running 2 tritonserver instances in the same container (with --reuse-grpc-port) actually doubles the throughput to 100k infer/sec. This may imply that there is no container- or resource-related bottleneck, but rather a limitation inside the tritonserver itself.

:question: Taking above-stated into account, I was wondering if you could please give more information about:

limits on the large number of inference requests due to Triton itself and not related to inference computation whatsoever
options/env var to increase the throughput. Maybe something like number of processing threads, grpc handling threads etc
tools (apart from perf-analyzer or model-analyzer) for profiling/monitoring tritonserver that could help us to find the bottleneck

Thank you!

triton-inference-server / server

50k-60k infer/sec limitation #7590