triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.39k stars 1.49k forks source link

Big performance drop when using ensemble model over separate calls #7650

Open jcuquemelle opened 2 months ago

jcuquemelle commented 2 months ago

Description We have an ensemble of 2 models chained together (description of models below).

Calling only the "preprocessing" model yields a max throughput of 21500 QPS @ 6 Cpu cores usage Calling only the "inference" model yields a max throughput of 44000 QPS @ 6 Cpu + 0.7 Gpu usage Calling the "ensemble" model yields a max throughput of less than 8500 QPS @ 10 Cpu + 0.7 Gpu usage

Why is there such a performance drop when using the ensemble ? given the figures for individual models, I'd expect the output of the ensemble to be around the same as the individual model with the less throughput i.e. 21500 QPS

The server-side metrics don't show any sign of blocking step in the inference pipeline although client-side the latency increases over time:

Request Rate: 8500 inference requests per seconds                                                                                       
  Pass [1] throughput: 8500.57 infer/sec. p75 latency: 3735 usec                                                                        
  Pass [2] throughput: 8499.32 infer/sec. p75 latency: 6115 usec
  Pass [3] throughput: 8500.86 infer/sec. p75 latency: 1298 usec
  Pass [4] throughput: 8282.7 infer/sec. p75 latency: 74570 usec
  Pass [5] throughput: 8384 infer/sec. p75 latency: 133184 usec
  Pass [6] throughput: 8425.3 infer/sec. p75 latency: 165502 usec
  Pass [7] throughput: 8255.55 infer/sec. p75 latency: 191599 usec
  Client: 
    Request count: 92706
    Throughput: 8356.54 infer/sec
    Avg client overhead: 18.98%
    p50 latency: 149491 usec
    p75 latency: 178315 usec
    p90 latency: 190544 usec
    p95 latency: 213366 usec
    p99 latency: 263968 usec
    Avg gRPC time: 153756 usec (marshal 5 usec + response wait 153751 usec + unmarshal 0 usec)
  Server: 
    Inference count: 92705
    Execution count: 92705
    Successful request count: 92705
    Avg request latency: 924 usec (overhead 265 usec + queue 90 usec + compute 569 usec)

  Composing models: 
  inference, version: 1
      Inference count: 92705
      Execution count: 73789
      Successful request count: 92705
      Avg request latency: 562 usec (overhead 107 usec + queue 67 usec + compute input 81 usec + compute infer 278 usec + compute output
 27 usec)

  preprocessing, version: 1
      Inference count: 92704
      Execution count: 87071
      Successful request count: 92704
      Avg request latency: 237 usec (overhead 33 usec + queue 23 usec + compute input 96 usec + compute infer 77 usec + compute output 7
 usec)

Triton Information Running in py3 offical container, version 24.06, on a Kube pod with 16 Cpus and 32GB RAM.

To Reproduce Composing models configurations: ensemble.txt preprocessing.txt inference.txt

preprocessing: Onnx model on CPU, dynamic batching Inference: Onnx model on Gpu with TRT execution provider, dynamic batching

Expected behavior The output of the ensemble is around the same as the individual model with the less throughput i.e. 21500 QPS

tanmayv25 commented 2 months ago

The performance degradation might be attributed to interference between the two composing models. From the report, it indeed looks like the requests are not getting blocked. What is the CPU utilization in all the three cases?

Can you try running model_analyzer for the ensemble to get the most performant model configuration for your use case? https://github.com/triton-inference-server/model_analyzer/blob/main/docs/ensemble_quick_start.md

jcuquemelle commented 1 month ago

Hi @tanmayv25

The Cpu usage is already mentioned for all 3 use cases, in a ballpark the resource usage for the ensemble is the sum of resource usage for preprocessing + inference, but we only get 8500 QPS against 21500 for the slowest of 2 composing models.

We didn't use model_analyser on this case because it's not our final use case which will be more complex, but we'd like to understand where the bottlenecks comes from on a simpler setup