triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
119 stars 54 forks source link

Triton-OnnxRt- TRT performance i #30

Open mayani-nv opened 3 years ago

mayani-nv commented 3 years ago

Description I downloaded the yolov3 model weights from here. Then using the Tensor-Rt sample scripts, I was able to get the corresponding onnx model file. The obtained onnx model file is similar to the one downloaded from the onnx model zoo (which uses the same weights but converted using keras2onnx)

Next, I ran the perf analyzer on this onnx model using different backends and got the following:

  1. Triton-ONNXRT-CUDA: Used the .onnx model file and run with the onnxruntime backend and got the following output

    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 0.6 infer/sec, latency 1498616 usec
    Concurrency: 2, throughput: 0.8 infer/sec, latency 2237485 usec
    Concurrency: 3, throughput: 0.6 infer/sec, latency 3406846 usec
    Concurrency: 4, throughput: 0.6 infer/sec, latency 4570913 usec
  2. Triton-ONNXRT-TRT: Used the .onnx model file but added the gpu accelarator as tensorrt (still ran with the onnxruntime backend) and got the following output

    Inferences/Second vs. Client Average Batch Latency
    Concurrency: 1, throughput: 1.2 infer/sec, latency 854637 usec
    Concurrency: 2, throughput: 2 infer/sec, latency 1011748 usec
    Concurrency: 3, throughput: 1.8 infer/sec, latency 1516845 usec
    Concurrency: 4, throughput: 1.8 infer/sec, latency 2023850 usec
  3. Triton-TRT: Converted the .onnx file to the .trt file. Ran with the tensor-rt backend and got following Inferences/Second vs. Client Average Batch Latency

    Concurrency: 1, throughput: 34.4 infer/sec, latency 29134 usec
    Concurrency: 2, throughput: 66 infer/sec, latency 30218 usec
    Concurrency: 3, throughput: 64.6 infer/sec, latency 46344 usec
    Concurrency: 4, throughput: 70.8 infer/sec, latency 56346 usec

    Why is the performance on the Triton-OnnxRT-TRT backend slow compared to the Triton-TRT backend. I used the Quadro RTX 8000 (same Turing architecture as T4) for this experiment.

Triton Information NGC container v20.12

ppyun commented 3 years ago

By the way, @mayani-nv communicated to me - All experimental results are on FP32.

askhade commented 3 years ago

@mayani-nv : This PR https://github.com/triton-inference-server/onnxruntime_backend/pull/42 to enable io binding should help with perf. Can you run your tests again once this is checked in?

I have not done any perf profiling for this model so cant say for sure if this PR will bring perf on par but it should definitely help.

Did you mean container version 21.02?

mayani-nv commented 2 years ago

I tried running the above tests with Triton v21.09 container and am ORT-TRT-Triton with FP32 enabled and getting following

Concurrency: 1, throughput: 0.8 infer/sec, latency 1252700 usec
Concurrency: 2, throughput: 1.1 infer/sec, latency 1842821 usec
Concurrency: 3, throughput: 1 infer/sec, latency 2780213 usec
Concurrency: 4, throughput: 1 infer/sec, latency 3710178 usec

I also tried with the ORT-TRT-Triton with FP16 enabled and getting following

Concurrency: 1, throughput: 1.42857 infer/sec, latency 718673 usec
Concurrency: 2, throughput: 2.42857 infer/sec, latency 817400 usec
Concurrency: 3, throughput: 2.42857 infer/sec, latency 1229651 usec
Concurrency: 4, throughput: 2.42857 infer/sec, latency 1644229 usec

I am not sure if the speed by using ORT-TRT in triton (FP16 v/s FP32) is still considerably slower than inferencing with the pure TRT model. So is this expected behavior?

rgov commented 2 years ago

Noting that I also see much lower performance from ORT-TRT than TRT outside of Triton.