Closed royinx closed 5 years ago
@royinx Hello, i am new to use tensorrt inference server The same model, deployed on tensorflow -severing and tensorrt inference server, is there a difference in speed between them
For a tensorflow model there is likely not much difference in performance. TRTIS does have some capabilities like multi-instance but for TF models those aren't (typically) much benefit because of limitations of the TF framework.
For the performance numbers you shared it is strange that your GPU utilization is lower for a larger batch size. Also your response wait time is large compared to the compute time. Both of those seem to indicate that perhaps the network is your bottleneck. Are you running perf-client on the same system as the inference server?
I am using 2 containers TRTIS - 19.09-py3
and TRTIS - client dockerfile
on same Desktop
CPU: 3700X GPU: 2080Super RAM: 32GB
I tested for the bandwidth of both containers , and the network bandwidth maybe not a bottleneck .
FP32 (4bytes) * 3 ch * 480 h * 640 w = 1.37GB/s
Docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
81c4c69fc7f1 trtis_perf_tester 0.00% 3.617MiB / 31.42GiB 0.01% 3.61GB / 26.5GB 512kB / 0B 1
a0a57751fbfe trtis 0.03% 6.624GiB / 31.42GiB 21.08% 26.5GB / 3.6GB 63.2MB / 0B 40
perf_client
root@81c4c69fc7f1:/workspace# perf_client -m face_lffd -u XX.XX.XX.XX:8001 -i gRPC -v -p3000 -d -l3000 -t5 -c5 -b64
*** Measurement Settings ***
Batch size: 64
Measurement window: 3000 msec
Latency limit: 3000 msec
Concurrency limit: 5 concurrent requests
Stabilizing using average latency
Request concurrency: 5
Pass [1] throughput: 42 infer/sec. Avg latency: 1098230 usec (std 65211 usec)
Pass [2] throughput: 384 infer/sec. Avg latency: 899362 usec (std 368437 usec)
Pass [3] throughput: 469 infer/sec. Avg latency: 703719 usec (std 124942 usec)
Pass [4] throughput: 426 infer/sec. Avg latency: 760636 usec (std 132688 usec)
Pass [5] throughput: 448 infer/sec. Avg latency: 739009 usec (std 97395 usec)
Client:
Request count: 21
Throughput: 448 infer/sec
Avg latency: 739009 usec (standard deviation 97395 usec)
p50 latency: 749726 usec
p90 latency: 862850 usec
p95 latency: 908628 usec
p99 latency: 984573 usec
Avg gRPC time: 733348 usec (marshal 48497 usec + response wait 683904 usec + unmarshal 947 usec)
Server:
Request count: 25
Avg request latency: 120107 usec (overhead 60 usec + queue 4227 usec + compute 115820 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, 448 infer/sec, latency 739009 usec
iperf network testing
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local XX.XX.XX.XX port 19999 connected with XX.XX.XX.XX port 33546
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 35.0 GBytes 30.1 Gbits/sec
[ 4] local XX.XX.XX.XX port 19999 connected with XX.XX.XX.XX port 33560
[ 4] 0.0-10.0 sec 35.8 GBytes 30.7 Gbits/sec
@royinx Hello, i am new to use tensorrt inference server The same model, deployed on tensorflow -severing and tensorrt inference server, is there a difference in speed between them
I haven't tried the TF model since if you check out the issues that a lot of problems on memory initialize and model optimization.
My works always work on pytorch and mxnet. I have tried out Libtorch (nightmare) , onnx , TF and TensorRT. Highly recommend TensorRT(model simplify) > ONNX Runtime > Framework.
More information
# | Model Instance | UTIL | VRAM | Throughput |
---|---|---|---|---|
1 | WITHOUT TRTIS 2 x TRT |
100% | 2 x 619MB | > 1000 (514 + 520) |
2 | 1 x TRTIS 2 x TRT |
50% | / | TRT - 340~ 400 |
3 | 1 x TRTIS 1 x TRT_A 1 x TRT_B |
~50% | / | TRT_A - 224 TRT_B - 224 |
Tested 3 models together with 3 different perf_client containers. 1 TRTIS with
The utilization works around 60-90%.
And I find only cause with 2 TRT.
instances_group{count:2}
According to Both of those seem to indicate that perhaps the network is your bottleneck
I tried set Model file N = 1 and config dynamic_batch_size = 1.
perf_client -m face_lffd -u xx.xx.xx.xx:8001 -i gRPC -v -p3000 -d -l3000 -t3 -c10
Request concurrency: 10
Pass [1] throughput: 402 infer/sec. Avg latency: 24835 usec (std 6011 usec)
Pass [2] throughput: 403 infer/sec. Avg latency: 24728 usec (std 5848 usec)
Pass [3] throughput: 404 infer/sec. Avg latency: 24780 usec (std 5759 usec)
Client:
Request count: 1212
Throughput: 404 infer/sec
Avg latency: 24780 usec (standard deviation 5759 usec)
p50 latency: 24260 usec
p90 latency: 31920 usec
p95 latency: 35120 usec
p99 latency: 40517 usec
Avg gRPC time: 24790 usec (marshal 1698 usec + response wait 23010 usec + unmarshal 82 usec)
Server:
Request count: 1453
Avg request latency: 8680 usec (overhead 37 usec + queue 34 usec + compute 8609 usec)
Inferences/Second vs. Client Average Batch Latency
Concurrency: 3, 403 infer/sec, latency 7426 usec
Concurrency: 4, 400 infer/sec, latency 9971 usec
Concurrency: 5, 399 infer/sec, latency 12505 usec
Concurrency: 6, 399 infer/sec, latency 15009 usec
Concurrency: 7, 395 infer/sec, latency 17663 usec
Concurrency: 8, 394 infer/sec, latency 20251 usec
Concurrency: 9, 402 infer/sec, latency 22348 usec
Concurrency: 10, 404 infer/sec, latency 24780 usec
I think network maybe the bottleneck of TRTIS since 2 x TRT models, because onnx is resnext-50 ,an expensive model, but small batch_size = 1.
Otherwise I tested with iperf , results around 30Gb/s , ~ 3.75GB/s
How to improve the network between containers? for now is ~ 1.4 - 2.0GB/s from docker stats
If your goal is to use server and client on the same device you should leverage the shared memory feature to minimized network overhead. This allows you to pass inputs and store outputs from TRTIS in system shared memory.
TRTIS Docker image: 19.09-py3 Client: perf_client OS : Ubuntu 18.04 CUDA Driver: 430.34 GPU model and memory: 12080Super / 12080TI / 1*T4 Model used: (TRT) model.plan file and config
I plan to deploy 1 model, 2 instances on TRTIS. The utilization limited at 50%. The situations I have tried :
From the results below, 2 trtis eat up VRAM but no improvement
with same Throughput , same utilization on 1GPU
Remarks: i also tried default and dynamic scheduler , same ~400 inf/s
Situation1 - 1 instance:
perf_client -m face_lffd -u xx.xx.xx.xx:8001 -i gRPC -v -p3000 -d -l3000 -t1 -c5 -b64
Situation2 - 2 instances:
perf_client -m face_lffd -u xx.xx.xx.xx:8001 -i gRPC -v -p3000 -d -l3000 -t1 -c5 -b64
I checked both #269 & #239 , used to start docker-compose with 2 TRTIS , 2 perf_client
Situation3 - 1 Instance on 2 TRTIS on 1 GPU:
perf_client on Server A:
perf_client on Server B: