triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

the queue and time spent processing in the ensemble model is humungous #7080

Open ChowXu opened 6 months ago

ChowXu commented 6 months ago

Hi, team,

I'm testing an ensemble model on Triton 24.03-py3. The model is composed of a simple model running on an ONNX model running on a CPU instance. Directly request the ONNX mode, the model p99 lantency is about 9ms , but if I request the ensemble model the p99 latency will increase to 20 ms . Any one can give some help ?

ChowXu commented 6 months ago

Ensemble config

name: "ensemble_model" platform: "ensemble" max_batch_size: 0 input [ { name: "ABSMModelMergedInput" data_type: TYPE_FP32 dims: [-1,100,39] # Variable batch size } ] output [ { name: "dense_11" data_type: TYPE_FP32 dims: [ -1,1] } ] ensemble_scheduling { step [ { model_name: "ABSMMODEL_V2" model_version: -1 input_map { key: "ABSMModelMergedInput" value: "ABSMModelMergedInput" } output_map { key: "dense_11" value: "dense_11" } } ] }

ChowXu commented 6 months ago

ONNX Model Config

name: "ABSMMODEL_V2" backend: "onnxruntime" max_batch_size : 0 input [ { name: "ABSMModelMergedInput" data_type: TYPE_FP32 dims: [-1,100,39] # Variable batch size } ] output [ { name: "dense_11" data_type: TYPE_FP32 dims: [ -1,1] } ]

ChowXu commented 6 months ago

performance

(base) root@aiml-jenkins-worker:/home/xzhou4# env file=test-data.json wrk/wrk -t1 -c1 -d30s --latency -s benchmark.lua http://localhost:8000/v2/models/ABSMMODEL_V3/infer Running 5m test @ http://localhost:8000/v2/models/ABSMMODEL_V3/infer 1 threads and 1 connections ^C Thread Stats Avg Stdev Max +/- Stdev Latency 4.71ms 818.43us 10.27ms 89.87% Req/Sec 210.67 10.60 230.00 53.33% Latency Distribution 50% 4.47ms 75% 4.81ms 90% 5.53ms 99% 8.70ms 315 requests in 1.50s, 65.52KB read Requests/sec: 209.73 Transfer/sec: 43.62KB

ChowXu commented 6 months ago

ensemble_model perf

(base) root@aiml-jenkins-worker:/home/xzhou4# env file=test-data.json wrk/wrk -t1 -c1 -d30s --latency -s benchmark.lua http://localhost:8000/v2/models/ensemble_model_1/infer Running 5m test @ http://localhost:8000/v2/models/ensemble_model_1/infer 1 threads and 1 connections ^C Thread Stats Avg Stdev Max +/- Stdev Latency 5.31ms 2.89ms 48.77ms 93.97% Req/Sec 197.49 42.49 242.00 86.06% Latency Distribution 50% 4.49ms 75% 5.19ms 90% 6.77ms 99% 19.51ms 4943 requests in 25.13s, 1.38MB read Requests/sec: 196.67 Transfer/sec: 56.08KB