triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.38k stars 1.49k forks source link

How to maximize single-model inference performance #7706

Open lei1liu opened 1 month ago

lei1liu commented 1 month ago

Description I'm running load testing to see what is the best configuration to achieve the highest QPS under the constraint of latency. I noticed that increasing instance count doesn't help improve the concurrency, though we have enough CPU cores and memory. Similar issue seems also reported in issue #7579

Triton Information What version of Triton are you using? Tested on v24.01 and v24.08

Are you using the Triton container or did you build it yourself? Docker container

To Reproduce

I'm testing a model on 2 triton pods, 3 instances for each.

defaultconfig.pbtxt:

backend: "tensorflow"
platform: "tensorflow_savedmodel"
max_batch_size: 8
dynamic_batching {
}
instance_group {
  count: 3
  kind: KIND_CPU
}
version_policy: { all { }}
model_operations { op_library_filename: "/home/inference.so" }

Expected behavior More parallel computation is done when creating more instances. The scheduling system should be able to process more requests with more instances, given that the CPU and input/output is not a bottleneck.