[Performance] CUDAExecutionProvider without RoiAlign (opset 16 version)

YuriGao commented 2 months ago

Describe the issue

i'm using cascade mask rcnn model in detectron2. when export onnx, it has RoiAlign (opset 16 version) in model file. when running on onnxruntime (Cuda EP), it's too slow since RoiAlign running on CPU EP. Could anyone provider RoiAlign (opset 16 version) on Cuda EP?

To reproduce

1、Exporting Cascade Mask RCNN in detectron2; 2、Running model in Onnxruntime Cuda EP;

Urgency

No response

Platform

Windows

OS Version

Win10

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8 and CUDA 12.2

Model File

No response

Is this a quantized model?

No

YuriGao commented 2 months ago

For running fast on Cuda EP, i have to use RoiAlign (Opset 10 version) and insert Sub Op before RoiAlign's rois input. Should notice that the Sub value is corresponding with RoiAlign's spatial_scale attrs. The Sub value should be 0.5 / RoiAlign["spatial_scale "]. It will be good for everyone if someone could upgrade the current RoiAlign Cuda EP implement.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime