microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.46k stars 2.75k forks source link

[Performance] #19479

Open sallamander317 opened 4 months ago

sallamander317 commented 4 months ago

Describe the issue

I have a script from compiling a pytorch model to ONNX that runs inference with the ONNX model, and when running inference on the GPU, it intermittently fails with the error:

return self._sess.run(output_names, input_feed, run_options)

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Expand node. Name:'/Expand_591' Status Message: /Expand_591: left operand cannot broadcast on dim 0 LeftShape: {243}, RightShape: {267}.

Some additional notes:

  1. In the script (see below), I'm running inference 10x (via a for loop). When it fails, it fails on the first iteration of the for loop and crashes the script. But, if I re-run the script, it sometimes doesn't fail on that first iteration and completes successfully. Thus, the intermittent nature here seems to be between iterations of the script, not between iterations of the for loop.
  2. Each time it runs into the error, it does have the Expand_591 node called out, and the RightShape {267} remains the same. However, the LeftShape (243 in the error example above) changes.
  3. When we compiled the model on the GPU but ran on the CPU, it ran successfully each time. However, it did not produce the same results as the underlying pytorch model.
  4. When we compiled this same model on CPU and tested using the CPUExecutionProvider, we ran into this error 100% of the time:
2024-02-08 22:53:05.966710901 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Gather node. 
Name:'/Gather_2452' Status Message: indices element out of data bounds, 
idx=264 must be within the inclusive range [-264,263] 

Traceback (most recent call last): File "/home/ec2-user/projects/onnx/test_onnx_model.py", line 60, 
in <module> ort_outs = ort_session.run(None, ort_inputs) 
File "/home/ec2-user/anaconda3/envs/onnx/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 220, in run return self._sess.run(output_names, input_feed, run_options)
  1. We got several different flavors of warnings when compiling:

To reproduce

Script I'm using to test (with private details removed):

import onnx
import onnxruntime
import torch
import numpy as np

device = torch.device("cuda")
input_tensor = torch.randn(1, 3, 1280, 896)
input_tensor = input_tensor.to(device)

def to_numpy(tensor):
      return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

onnx_model = onnx.load("exp_06_aug_stacked_strong_v5_step_50_epoch_69.onnx")
onnx.checker.check_model(onnx_model)

ort_session = onnxruntime.InferenceSession(
    "exp_06_aug_stacked_strong_v5_step_50_epoch_69.onnx",
    providers=['CUDAExecutionProvider']
)
# compute ONNX Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(input_tensor)}
for idx in range(10):
    ort_outs = ort_session.run(None, ort_inputs)

Urgency

Urgent at the 1 month time frame - we're blocked in deploying an ONNX model by this.

Platform

Linux

OS Version

centos rhel fedora Amazon Linux 2

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

12.2

Model File

Can't for privacy reasons

Is this a quantized model?

No

tianleiwu commented 4 months ago

I guess your modeling script has some tensor shape is dynamic based on input tensor. You can use input tensor that has encountered ORT error, and use torch to export a new onnx model with that input tensor, then use Netron tool to view the graph (focus on those nodes like Expand_591, Gather_2452 that ORT reports error).

You can also take a look at TracerWarning during torch onnx export, and modify your PyTorch modeling script to resolve those warnings.

sallamander317 commented 4 months ago

Hi @tianleiwu..

I ended up separating out the code that was causing the errors and was able to compile. Our model is a RetinaNet and we had the decoding / NMS step wrapped into the actual model architecture. Once I separated those out, I was able to have the model compile. I'm now confused about the timing stats that I'm seeing - I'm seeing no speed gains for any combination of compilation method (FP32, FP16, quantization), hardware (GPU or CPU), or provider (CPU, CUDA, or TensorRT). I've tried looking through the performance documentation and it doesn't seem like there's anything left for me to try. Should I be seeing speed gains here?

For reference, here's a table with the relative timings that show no difference (these times were taken using the script pasted below):

Model Prediction time per input (seconds)
Torch FP32 0.0905
ONNX FP32 on GPU 0.1105
ONNX FP32 on TensorRT 0.1055
ONNX FP32 on CPU 0.6733
Torch FP16 0.0456
ONNX FP16 on GPU 0.0567
ONNX FP16 on CPU 0.8547
ONNX FP16 on TensorRT 0.0878
ONNX FP32 Converted to FP16 on GPU 0.0576
ONNX FP32 Quantized via ONNX on GPU 0.9828
ONNX FP32 Quantized via ONNX on CPU 0.6912

Some readability notes:

As you can see, the pytorch native Torch FP32 is the fastest among any of the FP32 options, and the pytorch native Torch FP16 is the fastest among any of the F16 options.

Are there other tips and / or tricks that I'm missing here that would help us see speed gains, or is it to be expected that we might not see them here?

tianleiwu commented 4 months ago

@sallamander317, RetinaNet might need NHWC memory format to speed up. Currently, NHWC for CUDA provider is still in progress, and hopefully we can enable NHWC in next release v1.18.

For now, you might try optimize the f32 onnx model to optimized fp16 model like the following (for GPU only):

python -m onnxruntime.transformers.optimizer --input fp32_model.onnx --output opt.onnx --model_type unet --use_gpu --opt_level 0 --float16

To compare with Torch, you might need use I/O binding for fair comparison. If input shape is fixed, you can use cuda graph to speed up.

For I/O binding and cuda graph example, you can refer to the following class: https://github.com/microsoft/onnxruntime/blob/ed550b5fe5aa41e182db84d2b2f2fb768121fd7a/onnxruntime/python/tools/transformers/io_binding_helper.py#L211

For TensorRT EP, you might follow our stable diffusion demo for model optimization and provider option settings

sallamander317 commented 4 months ago

@tianleiwu thanks for the suggestions here! Any idea when the v1.18 release is coming out?

I tried the python -m onnxruntime.transformers.optimizer command and ended up getting the following error when trying to load the opt.onnx into a ort.InferenceSession:

Screen Shot 2024-03-01 at 11 24 59 AM

RE the CudaSession - this looks great, I can give that a shot. Do you have any documentation showing examples of how to use it? It looks pretty straightforward, but examples are always helpful if you have them.

I'll take a look at the stable diffusion demo too - thank you!