triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.05k stars 1.44k forks source link

[question][performance] How to improve ensemble model performance #6422

Closed qihang720 closed 10 months ago

qihang720 commented 11 months ago

Description Hi, I have two models, one is dali, another one is inceptionresnet_trt, and I use ensemble model to compose those together, ut the performance is worse than I expected.

Here are the individual performance metrics for each module by perf_analyzer.

dali performance

Concurrency: 128, throughput: 901.213 infer/sec, latency 142004 usec
Concurrency: 192, throughput: 904.907 infer/sec, latency 212052 usec
Concurrency: 256, throughput: 885.352 infer/sec, latency 289257 usec

trt performance

Concurrency: 128, throughput: 3373.73 infer/sec, latency 303410 usec
Concurrency: 192, throughput: 3313 infer/sec, latency 463416 usec
Concurrency: 256, throughput: 3310.97 infer/sec, latency 618280 usec

ensemble performance

Concurrency: 128, throughput: 295.212 infer/sec, latency 432599 usec
Concurrency: 192, throughput: 290.369 infer/sec, latency 661192 usec
Concurrency: 256, throughput: 285.376 infer/sec, latency 898961 usec

I have two questions:

  1. Are the modules in the Ensemble module running in parallel? Based on my understanding, according to the bottleneck effect, the performance of the Ensemble module should be consistent with that of DALI module.

  2. What methods do I have to improve performance?

Triton Information nvcr.io/nvidia/tritonserver:23.05-py3

To Reproduce dali pipeline

import nvidia.dali as dali
import nvidia.dali.types as types
from nvidia.dali.plugin.triton import autoserialize

@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, device="gpu", resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="CHW",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

dali configs

name: "dali"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "encoded"
    data_type: TYPE_UINT8
    dims: [ -1 ]
    allow_ragged_batch: true
  }
]

output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [  3, 299, 299 ]
  }
]

parameters: [
  {
    key: "num_threads"
    value: { string_value: "32" }
  },
  {
    key: "batch_size"
    value: { string_value: "32" }
  }
]

dynamic_batching {
  preferred_batch_size: [32]
  max_queue_delay_microseconds: 20000
}

instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

dali performance

**perf_analyzer -i grpc -u localhost:8003 -p100000 -m dali --input-data dataset.json --concurrency-range=128:256:64**
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 100000 msec
  Latency limit: 0 msec
  Concurrency limit: 256 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 128
  Client: 
    Request count: 325108
    Throughput: 901.213 infer/sec
    Avg latency: 142004 usec (standard deviation 4081 usec)
    p50 latency: 61563 usec
    p90 latency: 468976 usec
    p95 latency: 498278 usec
    p99 latency: 575428 usec
    Avg gRPC time: 141979 usec ((un)marshal request/response 47 usec + response wait 141932 usec)
  Server: 
    Inference count: 325084
    Execution count: 10206
    Successful request count: 325084
    **Avg request latency: 160185 usec (overhead 29 usec + queue 23478 usec + compute input 117 usec + compute infer 15999 usec + compute output 120561 usec)**

Request concurrency: 192
  Client: 
    Request count: 326110
    Throughput: 904.907 infer/sec
    Avg latency: 212052 usec (standard deviation 3968 usec)
    p50 latency: 94653 usec
    p90 latency: 532570 usec
    p95 latency: 558345 usec
    p99 latency: 686373 usec
    Avg gRPC time: 212030 usec ((un)marshal request/response 50 usec + response wait 211980 usec)
  Server: 
    Inference count: 326137
    Execution count: 8190
    Successful request count: 326137
    **Avg request latency: 243256 usec (overhead 39 usec + queue 56225 usec + compute input 186 usec + compute infer 22304 usec + compute output 164502 usec)**

Request concurrency: 256
  Client: 
    Request count: 318966
    Throughput: 885.352 infer/sec
    Avg latency: 289257 usec (standard deviation 7119 usec)
    p50 latency: 141051 usec
    p90 latency: 602121 usec
    p95 latency: 637016 usec
    p99 latency: 967791 usec
    Avg gRPC time: 289217 usec ((un)marshal request/response 50 usec + response wait 289167 usec)
  Server: 
    Inference count: 318987
    Execution count: 6254
    Successful request count: 318987
    **Avg request latency: 334483 usec (overhead 50 usec + queue 70125 usec + compute input 266 usec + compute infer 28835 usec + compute output 235205 usec)**

Inferences/Second vs. Client Average Batch Latency
Concurrency: 128, throughput: 901.213 infer/sec, latency 142004 usec
Concurrency: 192, throughput: 904.907 infer/sec, latency 212052 usec
Concurrency: 256, throughput: 885.352 infer/sec, latency 289257 usec

trt configs

name: "inceptionresnetv2_trt"
platform: "tensorrt_plan"
max_batch_size: 256

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 299, 299 ]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

dynamic_batching {
  preferred_batch_size: [32,64]
  max_queue_delay_microseconds: 20000
}

instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

trt performance

perf_analyzer -i grpc -u localhost:8003 -p100000 -m inceptionresnetv2_trt -b8 --concurrency-range=128:256:64 --shared-memory cuda
  Batch size: 8
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 100000 msec
  Latency limit: 0 msec
  Concurrency limit: 256 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 128
  Client: 
    Request count: 151827
    Throughput: 3373.73 infer/sec
    Avg latency: 303410 usec (standard deviation 10678 usec)
    p50 latency: 317383 usec
    p90 latency: 351802 usec
    p95 latency: 374727 usec
    p99 latency: 392750 usec
    Avg gRPC time: 303389 usec ((un)marshal request/response 9 usec + response wait 303380 usec)
  Server: 
    Inference count: 1214616
    Execution count: 9574
    Successful request count: 151827
    Avg request latency: 304009 usec (overhead 3087 usec + queue 50873 usec + compute input 7542 usec + compute infer 107353 usec + compute output 135153 usec)

Request concurrency: 192
  Client: 
    Request count: 149091
    Throughput: 3313 infer/sec
    Avg latency: 463416 usec (standard deviation 8586 usec)
    p50 latency: 476109 usec
    p90 latency: 483732 usec
    p95 latency: 499809 usec
    p99 latency: 529377 usec
    Avg gRPC time: 463394 usec ((un)marshal request/response 9 usec + response wait 463385 usec)
  Server: 
    Inference count: 1192728
    Execution count: 6052
    Successful request count: 149091
    Avg request latency: 464639 usec (overhead 3626 usec + queue 64467 usec + compute input 25158 usec + compute infer 154435 usec + compute output 216952 usec)

Request concurrency: 256
  Client: 
    Request count: 149001
    Throughput: 3310.97 infer/sec
    Avg latency: 618280 usec (standard deviation 6959 usec)
    p50 latency: 613077 usec
    p90 latency: 686041 usec
    p95 latency: 707274 usec
    p99 latency: 740889 usec
    Avg gRPC time: 618258 usec ((un)marshal request/response 9 usec + response wait 618249 usec)
  Server: 
    Inference count: 1192080
    Execution count: 5689
    Successful request count: 149010
    Avg request latency: 619643 usec (overhead 3992 usec + queue 188815 usec + compute input 24961 usec + compute infer 167886 usec + compute output 233988 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 128, throughput: 3373.73 infer/sec, latency 303410 usec
Concurrency: 192, throughput: 3313 infer/sec, latency 463416 usec
Concurrency: 256, throughput: 3310.97 infer/sec, latency 618280 usec
Successfully run inceptionresnetv2_trt

ensemble config

name: "ensemble_dali_inception"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "encoded"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali"
      model_version: -1
      input_map {
        key: "encoded"
        value: "encoded"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "inceptionresnetv2_trt"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "output"
        value: "OUTPUT"
      }
    }
  ]
}

ensemble performance

**perf_analyzer -i grpc -u localhost:8003 -p100000 -m ensemble_dali_inception --input-data dataset.json --concurrency-range=128:256:64**
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 100000 msec
  Latency limit: 0 msec
  Concurrency limit: 256 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 128
  Client: 
    Request count: 106532
    Throughput: 295.212 infer/sec
    Avg latency: 432599 usec (standard deviation 10899 usec)
    p50 latency: 58747 usec
    p90 latency: 924976 usec
    p95 latency: 949175 usec
    p99 latency: 1036743 usec
    Avg gRPC time: 432607 usec ((un)marshal request/response 6 usec + response wait 432601 usec)
  Server: 
    Inference count: 106528
    Execution count: 106528
    Successful request count: 106528
    **Avg request latency: 471530 usec (overhead 166915 usec + queue 48639 usec + compute 255976 usec)**

  Composing models: 
  dali, version: 1
      Inference count: 106614
      Execution count: 3538
      Successful request count: 106614
      **Avg request latency: 207766 usec (overhead 27 usec + queue 5402 usec + compute input 103 usec + compute infer 10564 usec + compute output 191669 usec)**

  inceptionresnetv2_trt, version: 1
      Inference count: 106528
      Execution count: 3552
      Successful request count: 106528
      **Avg request latency: 281695 usec (overhead 184819 usec + queue 43237 usec + compute input 2191 usec + compute infer 14305 usec + compute output 37142 usec)**

Request concurrency: 192
  Client: 
    Request count: 104587
    Throughput: 290.369 infer/sec
    Avg latency: 661192 usec (standard deviation 9616 usec)
    p50 latency: 923858 usec
    p90 latency: 973451 usec
    p95 latency: 998697 usec
    p99 latency: 1224770 usec
    Avg gRPC time: 661189 usec ((un)marshal request/response 6 usec + response wait 661183 usec)
  Server: 
    Inference count: 104569
    Execution count: 104569
    Successful request count: 104569
    **Avg request latency: 713323 usec (overhead 244368 usec + queue 74972 usec + compute 393983 usec)**

  Composing models: 
  dali, version: 1
      Inference count: 104589
      Execution count: 3488
      Successful request count: 104589
      **Avg request latency: 302694 usec (overhead 29 usec + queue 9044 usec + compute input 100 usec + compute infer 11353 usec + compute output 282167 usec)**

  inceptionresnetv2_trt, version: 1
      Inference count: 104569
      Execution count: 3187
      Successful request count: 104569
      **Avg request latency: 438725 usec (overhead 272435 usec + queue 65928 usec + compute input 3886 usec + compute infer 18663 usec + compute output 77812 usec)**

Request concurrency: 256
  Client: 
    Request count: 102787
    Throughput: 285.376 infer/sec
    Avg latency: 898961 usec (standard deviation 4812 usec)
    p50 latency: 960890 usec
    p90 latency: 1089060 usec
    p95 latency: 1832371 usec
    p99 latency: 2062218 usec
    Avg gRPC time: 898927 usec ((un)marshal request/response 6 usec + response wait 898921 usec)
  Server: 
    Inference count: 102751
    Execution count: 102751
    Successful request count: 102751
    **Avg request latency: 966696 usec (overhead 338527 usec + queue 119041 usec + compute 509128 usec)**

  Composing models: 
  dali, version: 1
      Inference count: 102833
      Execution count: 3359
      Successful request count: 102833
      **Avg request latency: 365314 usec (overhead 27 usec + queue 28411 usec + compute input 98 usec + compute infer 11704 usec + compute output 325073 usec)**

  inceptionresnetv2_trt, version: 1
      Inference count: 102750
      Execution count: 2780
      Successful request count: 102752
      **Avg request latency: 638580 usec (overhead 375698 usec + queue 90630 usec + compute input 3352 usec + compute infer 20978 usec + compute output 147921 usec)**

Inferences/Second vs. Client Average Batch Latency
Concurrency: 128, throughput: 295.212 infer/sec, latency 432599 usec
Concurrency: 192, throughput: 290.369 infer/sec, latency 661192 usec
Concurrency: 256, throughput: 285.376 infer/sec, latency 898961 usec
qihang720 commented 11 months ago

I found an interesting thing,if trt input is not a batched data, perf_analyzer -i grpc -u localhost:8003 -p100000 -m inceptionresnetv2_trt --concurrency-range=128:256:64 --shared-memory cuda, didn't add -b8 compare with previous perf_analyzer,performance only have 783.788 infer/sec.

So how can I batched different size of images to ensemble models, to improve performance for whole pipeline.

more details:

  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 100000 msec
  Latency limit: 0 msec
  Concurrency limit: 256 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 128
  Client: 
    Request count: 282460
    Throughput: 783.788 infer/sec
    Avg latency: 163087 usec (standard deviation 3172 usec)
    p50 latency: 38878 usec
    p90 latency: 695446 usec
    p95 latency: 731166 usec
    p99 latency: 826079 usec
    Avg gRPC time: 163082 usec ((un)marshal request/response 7 usec + response wait 163075 usec)
  Server: 
    Inference count: 282460
    Execution count: 8232
    Successful request count: 282460
    Avg request latency: 165045 usec (overhead 103572 usec + queue 4570 usec + compute input 3425 usec + compute infer 26264 usec + compute output 27213 usec)

Request concurrency: 192
  Client: 
    Request count: 284853
    Throughput: 790.467 infer/sec
    Avg latency: 242491 usec (standard deviation 7522 usec)
    p50 latency: 58090 usec
    p90 latency: 724592 usec
    p95 latency: 745884 usec
    p99 latency: 863566 usec
    Avg gRPC time: 242492 usec ((un)marshal request/response 7 usec + response wait 242485 usec)
  Server: 
    Inference count: 284847
    Execution count: 7883
    Successful request count: 284847
    Avg request latency: 242548 usec (overhead 132935 usec + queue 7430 usec + compute input 5782 usec + compute infer 27710 usec + compute output 68689 usec)

Request concurrency: 256
  Client: 
    Request count: 286033
    Throughput: 793.492 infer/sec
    Avg latency: 321997 usec (standard deviation 6996 usec)
    p50 latency: 77793 usec
    p90 latency: 756367 usec
    p95 latency: 771339 usec
    p99 latency: 863713 usec
    Avg gRPC time: 321987 usec ((un)marshal request/response 7 usec + response wait 321980 usec)
  Server: 
    Inference count: 286024
    Execution count: 7506
    Successful request count: 286024
    Avg request latency: 322118 usec (overhead 153134 usec + queue 16401 usec + compute input 5955 usec + compute infer 30114 usec + compute output 116513 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 128, throughput: 783.788 infer/sec, latency 163087 usec
Concurrency: 192, throughput: 790.467 infer/sec, latency 242491 usec
Concurrency: 256, throughput: 793.492 infer/sec, latency 321997 usec
Successfully run inceptionresnetv2_trt
tanmayv25 commented 11 months ago

@qihang720 Thanks for sharing all these performance numbers! Regarding your questions:

Are the modules in the Ensemble module running in parallel? Based on my understanding, according to the bottleneck effect, the performance of the Ensemble module should be consistent with that of DALI module.

Yes. Given that you are running concurrent requests, the two stages of the ensemble would be running in parallel on different set of requests.

What methods do I have to improve performance?

From what you have shared it seems that the GPU resources are creating a bottleneck here. When the two models are running simultaneously they are hurting each others performance. I see that you have 4 instances of both the models. Hence, this is most likely the case.

Assuming this to be true, you can derive with the most performant number of instance of your composing models in the ensemble for your target GPU. You may use Triton Model Analyzer to obtain this recipe.

If running even a single instance of the two models are not giving you desired results, you can use rate-limiter to run only one of the model on GPU at a time. This should allow you to get consistent performance with DALI only most likely.