Open sallamander317 opened 4 months ago
I guess your modeling script has some tensor shape is dynamic based on input tensor. You can use input tensor that has encountered ORT error, and use torch to export a new onnx model with that input tensor, then use Netron tool to view the graph (focus on those nodes like Expand_591, Gather_2452 that ORT reports error).
You can also take a look at TracerWarning during torch onnx export, and modify your PyTorch modeling script to resolve those warnings.
Hi @tianleiwu..
I ended up separating out the code that was causing the errors and was able to compile. Our model is a RetinaNet and we had the decoding / NMS step wrapped into the actual model architecture. Once I separated those out, I was able to have the model compile. I'm now confused about the timing stats that I'm seeing - I'm seeing no speed gains for any combination of compilation method (FP32, FP16, quantization), hardware (GPU or CPU), or provider (CPU, CUDA, or TensorRT). I've tried looking through the performance documentation and it doesn't seem like there's anything left for me to try. Should I be seeing speed gains here?
For reference, here's a table with the relative timings that show no difference (these times were taken using the script pasted below):
Model | Prediction time per input (seconds) |
---|---|
Torch FP32 | 0.0905 |
ONNX FP32 on GPU | 0.1105 |
ONNX FP32 on TensorRT | 0.1055 |
ONNX FP32 on CPU | 0.6733 |
Torch FP16 | 0.0456 |
ONNX FP16 on GPU | 0.0567 |
ONNX FP16 on CPU | 0.8547 |
ONNX FP16 on TensorRT | 0.0878 |
ONNX FP32 Converted to FP16 on GPU | 0.0576 |
ONNX FP32 Quantized via ONNX on GPU | 0.9828 |
ONNX FP32 Quantized via ONNX on CPU | 0.6912 |
Some readability notes:
pytorch
FP32 and FP16 modelspytorch
FP32 or FP16 model compiled into ONNX (but with the precision as it was in pytorch
)pytorch
FP32 model compiled to ONNX in FP32, but then converted to FP16 using ONNX's converstion pipeline. onnxruntime.InferenceSession
session. As you can see, the pytorch
native Torch FP32
is the fastest among any of the FP32 options, and the pytorch
native Torch FP16
is the fastest among any of the F16 options.
Are there other tips and / or tricks that I'm missing here that would help us see speed gains, or is it to be expected that we might not see them here?
@sallamander317, RetinaNet might need NHWC memory format to speed up. Currently, NHWC for CUDA provider is still in progress, and hopefully we can enable NHWC in next release v1.18.
For now, you might try optimize the f32 onnx model to optimized fp16 model like the following (for GPU only):
python -m onnxruntime.transformers.optimizer --input fp32_model.onnx --output opt.onnx --model_type unet --use_gpu --opt_level 0 --float16
To compare with Torch, you might need use I/O binding for fair comparison. If input shape is fixed, you can use cuda graph to speed up.
For I/O binding and cuda graph example, you can refer to the following class: https://github.com/microsoft/onnxruntime/blob/ed550b5fe5aa41e182db84d2b2f2fb768121fd7a/onnxruntime/python/tools/transformers/io_binding_helper.py#L211
For TensorRT EP, you might follow our stable diffusion demo for model optimization and provider option settings
@tianleiwu thanks for the suggestions here! Any idea when the v1.18
release is coming out?
I tried the python -m onnxruntime.transformers.optimizer
command and ended up getting the following error when trying to load the opt.onnx
into a ort.InferenceSession
:
RE the CudaSession
- this looks great, I can give that a shot. Do you have any documentation showing examples of how to use it? It looks pretty straightforward, but examples are always helpful if you have them.
I'll take a look at the stable diffusion demo too - thank you!
Describe the issue
I have a script from compiling a
pytorch
model to ONNX that runs inference with the ONNX model, and when running inference on the GPU, it intermittently fails with the error:Some additional notes:
Expand_591
node called out, and theRightShape {267}
remains the same. However, theLeftShape
(243 in the error example above) changes.CPUExecutionProvider
, we ran into this error 100% of the time:To reproduce
Script I'm using to test (with private details removed):
Urgency
Urgent at the 1 month time frame - we're blocked in deploying an ONNX model by this.
Platform
Linux
OS Version
centos rhel fedora Amazon Linux 2
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.16.3
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
12.2
Model File
Can't for privacy reasons
Is this a quantized model?
No