Open mayani-nv opened 3 years ago
By the way, @mayani-nv communicated to me - All experimental results are on FP32.
@mayani-nv : This PR https://github.com/triton-inference-server/onnxruntime_backend/pull/42 to enable io binding should help with perf. Can you run your tests again once this is checked in?
I have not done any perf profiling for this model so cant say for sure if this PR will bring perf on par but it should definitely help.
Did you mean container version 21.02?
I tried running the above tests with Triton v21.09 container and am ORT-TRT-Triton with FP32 enabled and getting following
Concurrency: 1, throughput: 0.8 infer/sec, latency 1252700 usec
Concurrency: 2, throughput: 1.1 infer/sec, latency 1842821 usec
Concurrency: 3, throughput: 1 infer/sec, latency 2780213 usec
Concurrency: 4, throughput: 1 infer/sec, latency 3710178 usec
I also tried with the ORT-TRT-Triton with FP16 enabled and getting following
Concurrency: 1, throughput: 1.42857 infer/sec, latency 718673 usec
Concurrency: 2, throughput: 2.42857 infer/sec, latency 817400 usec
Concurrency: 3, throughput: 2.42857 infer/sec, latency 1229651 usec
Concurrency: 4, throughput: 2.42857 infer/sec, latency 1644229 usec
I am not sure if the speed by using ORT-TRT in triton (FP16 v/s FP32) is still considerably slower than inferencing with the pure TRT model. So is this expected behavior?
Noting that I also see much lower performance from ORT-TRT than TRT outside of Triton.
Description I downloaded the yolov3 model weights from here. Then using the Tensor-Rt sample scripts, I was able to get the corresponding onnx model file. The obtained onnx model file is similar to the one downloaded from the onnx model zoo (which uses the same weights but converted using keras2onnx)
Next, I ran
the perf analyzer
on this onnx model using different backends and got the following:Triton-ONNXRT-CUDA: Used the
.onnx
model file and run with the onnxruntime backend and got the following outputTriton-ONNXRT-TRT: Used the
.onnx
model file but added the gpu accelarator as tensorrt (still ran with the onnxruntime backend) and got the following outputTriton-TRT: Converted the
.onnx
file to the.trt
file. Ran with the tensor-rt backend and got following Inferences/Second vs. Client Average Batch LatencyWhy is the performance on the
Triton-OnnxRT-TRT
backend slow compared to theTriton-TRT backend
. I used the Quadro RTX 8000 (same Turing architecture as T4) for this experiment.Triton Information NGC container v20.12