Closed qihang720 closed 10 months ago
I found an interesting thing,if trt input is not a batched data, perf_analyzer -i grpc -u localhost:8003 -p100000 -m inceptionresnetv2_trt --concurrency-range=128:256:64 --shared-memory cuda
, didn't add -b8
compare with previous perf_analyzer,performance only have 783.788 infer/sec.
So how can I batched different size of images to ensemble models, to improve performance for whole pipeline.
more details:
Batch size: 1
Service Kind: Triton
Using "time_windows" mode for stabilization
Measurement window: 100000 msec
Latency limit: 0 msec
Concurrency limit: 256 concurrent requests
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 128
Client:
Request count: 282460
Throughput: 783.788 infer/sec
Avg latency: 163087 usec (standard deviation 3172 usec)
p50 latency: 38878 usec
p90 latency: 695446 usec
p95 latency: 731166 usec
p99 latency: 826079 usec
Avg gRPC time: 163082 usec ((un)marshal request/response 7 usec + response wait 163075 usec)
Server:
Inference count: 282460
Execution count: 8232
Successful request count: 282460
Avg request latency: 165045 usec (overhead 103572 usec + queue 4570 usec + compute input 3425 usec + compute infer 26264 usec + compute output 27213 usec)
Request concurrency: 192
Client:
Request count: 284853
Throughput: 790.467 infer/sec
Avg latency: 242491 usec (standard deviation 7522 usec)
p50 latency: 58090 usec
p90 latency: 724592 usec
p95 latency: 745884 usec
p99 latency: 863566 usec
Avg gRPC time: 242492 usec ((un)marshal request/response 7 usec + response wait 242485 usec)
Server:
Inference count: 284847
Execution count: 7883
Successful request count: 284847
Avg request latency: 242548 usec (overhead 132935 usec + queue 7430 usec + compute input 5782 usec + compute infer 27710 usec + compute output 68689 usec)
Request concurrency: 256
Client:
Request count: 286033
Throughput: 793.492 infer/sec
Avg latency: 321997 usec (standard deviation 6996 usec)
p50 latency: 77793 usec
p90 latency: 756367 usec
p95 latency: 771339 usec
p99 latency: 863713 usec
Avg gRPC time: 321987 usec ((un)marshal request/response 7 usec + response wait 321980 usec)
Server:
Inference count: 286024
Execution count: 7506
Successful request count: 286024
Avg request latency: 322118 usec (overhead 153134 usec + queue 16401 usec + compute input 5955 usec + compute infer 30114 usec + compute output 116513 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 128, throughput: 783.788 infer/sec, latency 163087 usec
Concurrency: 192, throughput: 790.467 infer/sec, latency 242491 usec
Concurrency: 256, throughput: 793.492 infer/sec, latency 321997 usec
Successfully run inceptionresnetv2_trt
@qihang720 Thanks for sharing all these performance numbers! Regarding your questions:
Are the modules in the Ensemble module running in parallel? Based on my understanding, according to the bottleneck effect, the performance of the Ensemble module should be consistent with that of DALI module.
Yes. Given that you are running concurrent requests, the two stages of the ensemble would be running in parallel on different set of requests.
What methods do I have to improve performance?
From what you have shared it seems that the GPU resources are creating a bottleneck here. When the two models are running simultaneously they are hurting each others performance. I see that you have 4 instances of both the models. Hence, this is most likely the case.
Assuming this to be true, you can derive with the most performant number of instance of your composing models in the ensemble for your target GPU. You may use Triton Model Analyzer to obtain this recipe.
If running even a single instance of the two models are not giving you desired results, you can use rate-limiter to run only one of the model on GPU at a time. This should allow you to get consistent performance with DALI only most likely.
Description Hi, I have two models, one is dali, another one is inceptionresnet_trt, and I use ensemble model to compose those together, ut the performance is worse than I expected.
Here are the individual performance metrics for each module by perf_analyzer.
dali performance
trt performance
ensemble performance
I have two questions:
Are the modules in the Ensemble module running in parallel? Based on my understanding, according to the bottleneck effect, the performance of the Ensemble module should be consistent with that of DALI module.
What methods do I have to improve performance?
Triton Information nvcr.io/nvidia/tritonserver:23.05-py3
To Reproduce dali pipeline
dali configs
dali performance
trt configs
trt performance
ensemble config
ensemble performance