triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

Multiple instance limited with 50% utilization #768

Closed royinx closed 5 years ago

royinx commented 5 years ago

TRTIS Docker image: 19.09-py3 Client: perf_client OS : Ubuntu 18.04 CUDA Driver: 430.34 GPU model and memory: 12080Super / 12080TI / 1*T4 Model used: (TRT) model.plan file and config

I plan to deploy 1 model, 2 instances on TRTIS. The utilization limited at 50%. The situations I have tried :

From the results below, 2 trtis eat up VRAM but no improvement

with same Throughput , same utilization on 1GPU

Remarks: i also tried default and dynamic scheduler , same ~400 inf/s

Situation1 - 1 instance: perf_client -m face_lffd -u xx.xx.xx.xx:8001 -i gRPC -v -p3000 -d -l3000 -t1 -c5 -b64

VRAM(trtserver): 2469 MiB

GPU utilization ~ 35%  (batch size 64 ) , ~50% (batch size 1)
===================================================
Request concurrency: 5
  Pass [1] throughput: 448 infer/sec. Avg latency: 717295 usec (std 207798 usec)
  Pass [2] throughput: 426 infer/sec. Avg latency: 716477 usec (std 77407 usec)
  Pass [3] throughput: 426 infer/sec. Avg latency: 727705 usec (std 77921 usec)
  Client: 
    Request count: 20
    Throughput: 426 infer/sec
    Avg latency: 727705 usec (standard deviation 77921 usec)
    p50 latency: 737671 usec
    p90 latency: 831071 usec
    p95 latency: 836330 usec
    p99 latency: 864522 usec
    Avg gRPC time: 732391 usec (marshal 47678 usec + response wait 683189 usec + unmarshal 1524 usec)
  Server: 
    Request count: 26
    Avg request latency: 109105 usec (overhead 51 usec + queue 2572 usec + compute 106482 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 298 infer/sec, latency 207470 usec
Concurrency: 2, 384 infer/sec, latency 322270 usec
Concurrency: 3, 426 infer/sec, latency 438554 usec
Concurrency: 4, 448 infer/sec, latency 558617 usec
Concurrency: 5, 426 infer/sec, latency 727705 usec

Situation2 - 2 instances: perf_client -m face_lffd -u xx.xx.xx.xx:8001 -i gRPC -v -p3000 -d -l3000 -t1 -c5 -b64

VRAM(trtserver): 3931 MiB

GPU utilization ~ 35%  (batch size 64 ) , ~50% (batch size 1)
===================================================
Request concurrency: 5
  Pass [1] throughput: 448 infer/sec. Avg latency: 729112 usec (std 143922 usec)
  Pass [2] throughput: 469 infer/sec. Avg latency: 754392 usec (std 129833 usec)
  Pass [3] throughput: 448 infer/sec. Avg latency: 747341 usec (std 117667 usec)
  Client: 
    Request count: 21
    Throughput: 448 infer/sec
    Avg latency: 747341 usec (standard deviation 117667 usec)
    p50 latency: 711100 usec
    p90 latency: 899084 usec
    p95 latency: 983171 usec
    p99 latency: 1014742 usec
    Avg gRPC time: 747195 usec (marshal 46338 usec + response wait 699715 usec + unmarshal 1142 usec)
  Server: 
    Request count: 25
    Avg request latency: 127904 usec (overhead 53 usec + queue 1490 usec + compute 126361 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 277 infer/sec, latency 222772 usec
Concurrency: 2, 384 infer/sec, latency 319762 usec
Concurrency: 3, 448 infer/sec, latency 443085 usec
Concurrency: 4, 469 infer/sec, latency 571926 usec
Concurrency: 5, 448 infer/sec, latency 747341 usec

I checked both #269 & #239 , used to start docker-compose with 2 TRTIS , 2 perf_client

Situation3 - 1 Instance on 2 TRTIS on 1 GPU:

VRAM(trtserver1): 2469 MiB
VRAM(trtserver2): 2469 MiB
GPU utilization ~ 50%

perf_client on Server A:

Request concurrency: 5
  Pass [1] throughput: 192 infer/sec. Avg latency: 1462076 usec (std 452607 usec)
  Pass [2] throughput: 213 infer/sec. Avg latency: 1645591 usec (std 285035 usec)
  Pass [3] throughput: 213 infer/sec. Avg latency: 1338029 usec (std 169647 usec)
  Pass [4] throughput: 213 infer/sec. Avg latency: 1336935 usec (std 238249 usec)
  Pass [5] throughput: 192 infer/sec. Avg latency: 1468982 usec (std 243846 usec)
  Client: 
    Request count: 9
    Throughput: 192 infer/sec
    Avg latency: 1468982 usec (standard deviation 243846 usec)
    p50 latency: 1429994 usec
    p90 latency: 1699857 usec
    p95 latency: 1905205 usec
    p99 latency: 1905205 usec
    Avg gRPC time: 1449012 usec (marshal 93166 usec + response wait 1349345 usec + unmarshal 6501 usec)
  Server: 
    Request count: 13
    Avg request latency: 176906 usec (overhead 95 usec + queue 1204 usec + compute 175607 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 213 infer/sec, latency 327844 usec
Concurrency: 2, 213 infer/sec, latency 601838 usec
Concurrency: 3, 277 infer/sec, latency 760015 usec
Concurrency: 4, 213 infer/sec, latency 1161671 usec
Concurrency: 5, 192 infer/sec, latency 1468982 usec

perf_client on Server B:

Request concurrency: 5
  Pass [1] throughput: 192 infer/sec. Avg latency: 1345482 usec (std 334982 usec)
  Pass [2] throughput: 170 infer/sec. Avg latency: 1595961 usec (std 171651 usec)
  Pass [3] throughput: 192 infer/sec. Avg latency: 1658977 usec (std 385140 usec)
  Pass [4] throughput: 213 infer/sec. Avg latency: 1446637 usec (std 183416 usec)
  Pass [5] throughput: 384 infer/sec. Avg latency: 868516 usec (std 261350 usec)
  Pass [6] throughput: 448 infer/sec. Avg latency: 756875 usec (std 101381 usec)
  Pass [7] throughput: 426 infer/sec. Avg latency: 747722 usec (std 83247 usec)
  Client: 
    Request count: 20
    Throughput: 426 infer/sec
    Avg latency: 747722 usec (standard deviation 83247 usec)
    p50 latency: 739064 usec
    p90 latency: 839012 usec
    p95 latency: 891422 usec
    p99 latency: 940661 usec
    Avg gRPC time: 753008 usec (marshal 48091 usec + response wait 703167 usec + unmarshal 1750 usec)
  Server: 
    Request count: 25
    Avg request latency: 111378 usec (overhead 57 usec + queue 4076 usec + compute 107245 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, 192 infer/sec, latency 312246 usec
Concurrency: 2, 192 infer/sec, latency 687711 usec
Concurrency: 3, 192 infer/sec, latency 939533 usec
Concurrency: 4, 192 infer/sec, latency 1216420 usec
Concurrency: 5, 426 infer/sec, latency 747722 usec
leo-XUKANG commented 5 years ago

@royinx Hello, i am new to use tensorrt inference server The same model, deployed on tensorflow -severing and tensorrt inference server, is there a difference in speed between them

deadeyegoodwin commented 5 years ago

For a tensorflow model there is likely not much difference in performance. TRTIS does have some capabilities like multi-instance but for TF models those aren't (typically) much benefit because of limitations of the TF framework.

For the performance numbers you shared it is strange that your GPU utilization is lower for a larger batch size. Also your response wait time is large compared to the compute time. Both of those seem to indicate that perhaps the network is your bottleneck. Are you running perf-client on the same system as the inference server?

royinx commented 5 years ago

I am using 2 containers TRTIS - 19.09-py3 and TRTIS - client dockerfile on same Desktop

CPU: 3700X GPU: 2080Super RAM: 32GB

I tested for the bandwidth of both containers , and the network bandwidth maybe not a bottleneck .

FP32 (4bytes) * 3 ch * 480 h * 640 w = 1.37GB/s

Docker stats

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
81c4c69fc7f1        trtis_perf_tester   0.00%               3.617MiB / 31.42GiB   0.01%               3.61GB / 26.5GB     512kB / 0B          1
a0a57751fbfe        trtis               0.03%               6.624GiB / 31.42GiB   21.08%              26.5GB / 3.6GB      63.2MB / 0B         40

perf_client

root@81c4c69fc7f1:/workspace# perf_client -m face_lffd -u  XX.XX.XX.XX:8001 -i gRPC -v -p3000 -d -l3000 -t5 -c5 -b64
*** Measurement Settings ***
  Batch size: 64
  Measurement window: 3000 msec
  Latency limit: 3000 msec
  Concurrency limit: 5 concurrent requests
  Stabilizing using average latency

Request concurrency: 5
  Pass [1] throughput: 42 infer/sec. Avg latency: 1098230 usec (std 65211 usec)
  Pass [2] throughput: 384 infer/sec. Avg latency: 899362 usec (std 368437 usec)
  Pass [3] throughput: 469 infer/sec. Avg latency: 703719 usec (std 124942 usec)
  Pass [4] throughput: 426 infer/sec. Avg latency: 760636 usec (std 132688 usec)
  Pass [5] throughput: 448 infer/sec. Avg latency: 739009 usec (std 97395 usec)
  Client: 
    Request count: 21
    Throughput: 448 infer/sec
    Avg latency: 739009 usec (standard deviation 97395 usec)
    p50 latency: 749726 usec
    p90 latency: 862850 usec
    p95 latency: 908628 usec
    p99 latency: 984573 usec
    Avg gRPC time: 733348 usec (marshal 48497 usec + response wait 683904 usec + unmarshal 947 usec)
  Server: 
    Request count: 25
    Avg request latency: 120107 usec (overhead 60 usec + queue 4227 usec + compute 115820 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 5, 448 infer/sec, latency 739009 usec

iperf network testing

TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[  4] local XX.XX.XX.XX port 19999 connected with XX.XX.XX.XX port 33546
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  35.0 GBytes  30.1 Gbits/sec
[  4] local XX.XX.XX.XX port 19999 connected with XX.XX.XX.XX port 33560
[  4]  0.0-10.0 sec  35.8 GBytes  30.7 Gbits/sec
royinx commented 5 years ago

@royinx Hello, i am new to use tensorrt inference server The same model, deployed on tensorflow -severing and tensorrt inference server, is there a difference in speed between them

I haven't tried the TF model since if you check out the issues that a lot of problems on memory initialize and model optimization.

My works always work on pytorch and mxnet. I have tried out Libtorch (nightmare) , onnx , TF and TensorRT. Highly recommend TensorRT(model simplify) > ONNX Runtime > Framework.

royinx commented 5 years ago

More information

# Model Instance UTIL VRAM Throughput
1 WITHOUT TRTIS
2 x TRT
100% 2 x 619MB > 1000
(514 + 520)
2 1 x TRTIS
2 x TRT
50% / TRT - 340~ 400
3 1 x TRTIS
1 x TRT_A
1 x TRT_B
~50% / TRT_A - 224
TRT_B - 224

Tested 3 models together with 3 different perf_client containers. 1 TRTIS with

The utilization works around 60-90%.

And I find only cause with 2 TRT.


Network

According to Both of those seem to indicate that perhaps the network is your bottleneck

I tried set Model file N = 1 and config dynamic_batch_size = 1.

perf_client -m face_lffd -u xx.xx.xx.xx:8001 -i gRPC -v -p3000 -d -l3000 -t3 -c10

Request concurrency: 10
  Pass [1] throughput: 402 infer/sec. Avg latency: 24835 usec (std 6011 usec)
  Pass [2] throughput: 403 infer/sec. Avg latency: 24728 usec (std 5848 usec)
  Pass [3] throughput: 404 infer/sec. Avg latency: 24780 usec (std 5759 usec)
  Client: 
    Request count: 1212
    Throughput: 404 infer/sec
    Avg latency: 24780 usec (standard deviation 5759 usec)
    p50 latency: 24260 usec
    p90 latency: 31920 usec
    p95 latency: 35120 usec
    p99 latency: 40517 usec
    Avg gRPC time: 24790 usec (marshal 1698 usec + response wait 23010 usec + unmarshal 82 usec)
  Server: 
    Request count: 1453
    Avg request latency: 8680 usec (overhead 37 usec + queue 34 usec + compute 8609 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 3, 403 infer/sec, latency 7426 usec
Concurrency: 4, 400 infer/sec, latency 9971 usec
Concurrency: 5, 399 infer/sec, latency 12505 usec
Concurrency: 6, 399 infer/sec, latency 15009 usec
Concurrency: 7, 395 infer/sec, latency 17663 usec
Concurrency: 8, 394 infer/sec, latency 20251 usec
Concurrency: 9, 402 infer/sec, latency 22348 usec
Concurrency: 10, 404 infer/sec, latency 24780 usec

I think network maybe the bottleneck of TRTIS since 2 x TRT models, because onnx is resnext-50 ,an expensive model, but small batch_size = 1.

Otherwise I tested with iperf , results around 30Gb/s , ~ 3.75GB/s

How to improve the network between containers? for now is ~ 1.4 - 2.0GB/s from docker stats

CoderHam commented 5 years ago

If your goal is to use server and client on the same device you should leverage the shared memory feature to minimized network overhead. This allows you to pass inputs and store outputs from TRTIS in system shared memory.