[Performance] ScatterND / GridSample operators are on CPU instead of GPU / CUDA

tikr7 commented 7 months ago

Describe the issue

We exported the Huggingface transformer model OneFormer into onnx.

Opset 20 failed with the error:

OnnxExporterWarning: Exporting to ONNX opset version 20 is not supported. by 'torch.onnx.export()'. The highest opset version supported is 17.

ValueError: Unsupported ONNX opset version: 20

With Opset 19 we were able to export to onnx but the onnxruntime puts the operators ScatterND / GridSample on CPU instead of GPU / CUDA. This drops the performance by factor of at least 4.

The first screenshot shows with Nvidia Nsight the pure pytorch model in python: unnamed (1)

The second screenshot shows the same with onnxruntime in python: unnamed (2)

With pytorch the gpu utilization looks very good and fast, while the onnxruntime uses cpu a lot, needs to switch a lot between vram and ram which drops gpu utilization and the model inference speed.

Relevant logs from the onnxruntime:

2024-04-11 19:01:52.425810000 [V:onnxruntime:, session_state.cc:1152 VerifyEachNodeIsAssignedToAnEp]  Node(s) placed on [CPUExecutionProvider]. Number of nodes: 2310

2024-04-11 19:01:51.327168362 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: ScatterND node name: /model/pixel_level_module/encoder/encoder/layers.1/blocks.1/ScatterND

2024-04-11 19:01:51.333633901 [I:onnxruntime:, cuda_execution_provider.cc:2397 GetCapability] CUDA kernel not found in registries for Op type: GridSample node name: /model/pixel_level_module/decoder/encoder/layers.2/self_attn/GridSample_2

2024-04-11 19:56:12.910005749 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.

According to the documentation ScatterND / GridSample operators should supported on cuda since Opset 18+.

Further information

pytorch 2.2.2 onnx 1.16.0 onnxruntime-gpu 1.17.1 cuda 11.8 (also tried 12.3) python 3.9 opset 19

To reproduce

If you need more details how to reproduce, we can provide the model and everything.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.17.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 11.8 (also tried 12.3)

Model File

179 MB zipped is too big got GitHub

Is this a quantized model?

Unknown

xadupre commented 7 months ago

This PR should solve it: https://github.com/microsoft/onnxruntime/pull/19540.

tikr7 commented 6 months ago

In the meantime we converted ONNX into then into TensorRT which is even 2x faster than the pure pytorch.

1059692261 commented 5 months ago

I still seeing this warning "CUDA kernel not found in registries for Op type: GridSample" and experiencing severe performance degradation. Further information: pytorch 2.4.0.dev20240515+cu121 onnx 1.16.0 onnxruntime-gpu 1.18.0 cuda 12.2 (also tried 11.8) python 3.8 opset 20 I also notice gridsample cuda kernel has already been implemented and is waiting to be merged( see #18958 ). May I ask when will this PR be merged?@xadupre

tianlinzx commented 5 months ago

Is there any estimation on the timeline? @xadupre

microsoft / onnxruntime