microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.56k stars 2.92k forks source link

[Performance] version 1.17.1 causes performance regression over 1.16.3 both with TRT EP and Cuda EP on Faster-RCNN model inference #19955

Open jcdatin opened 7 months ago

jcdatin commented 7 months ago

Describe the issue

observed that performance has decreased at least 25% with TRT Ep and 18% with Cuda EP on my Faster-Rcnn model inference. From 2860ms to 3785ms and from 3015ms to 3680ms. (Interestingly with ORT 1.17.1 Cuda Ep becomes faster than TRt EP !).

To reproduce

my model is private but https://pytorch.org/vision/main/models/faster_rcnn.html hould do since my model is also based on trochvision

Urgency

medium : need to fix within 3 months

Platform

Linux

OS Version

SLES15 SP4 and SLES15 SP5

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.17.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA, TensorRT

Execution Provider Library Version

CUDA 12.2 and TRT 8.6.1.6

Model File

can't share note that I am using an RTX8000 and inference is done with nvidia TF32 with both TRT and Cuda EP (can't use FP16) ONNRT and its cuda modeules were compiled with gcc11

Is this a quantized model?

No

jcdatin commented 7 months ago

how can we make progress here ?

jcdatin commented 7 months ago

I tried on Turing and ADA GPUs , I tried with different version of Cudnn , 1.17.1 inference on the faster-rcnn is always slower by 30% at least. Note that TRT used is same between the 2 ONNXRT versions used I guess I need to build a demonstrator I can share, but any guidance about determining which ONNX instruction is talking longer in 1.17.1 with my production model ?

jcdatin commented 7 months ago

did not succeed in executing inference of torchvision fasterrcnn_resnet50_fpn opensource model with onnxruntime TensorRT ExecutionProvider (I used torchvision https://github.com/pytorch/vision/blob/main/test/test_onnx.py TestONNXExporter::test_faster_rcnn)

succeeded though in running with onnxruntime CUDAExecution provider but not with TensorRT Execution provider .

Torchvision test wants to fallback to CUDA Execution Provider because fasterRCNN exported model missed the inferred dimension that TensorRT requires.

Unfortunately onnxruntime/tools/symbolic_shape_infer.py tool also crashes on converting exported onnx fasterRCNN model for tensorRT (I needed to run this tool too on my moel to run it on TensorRT Execution Provider)

I am stuck in providing a non IP demonstration code of faster_rcnn for onnxrtuntiome TensorRT Execution Provider and hence help define which onnx operator regresses in Onnxrt 1.17.1 versus 1.16.3

Need instructions for determining myself the onnx operator that regresses on my model and indicate to onnxrt team: any help ?