microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.5k stars 2.76k forks source link

[TensorRT EP] OOM (RAM) when loading ONNX model #21219

Closed MiroPsota closed 19 hours ago

MiroPsota commented 1 week ago

Describe the issue

OOM (RAM) when loading the model - 50GiB is not enough.

To reproduce

The model is RTMDet from here. ONNX exported model has TensorRT specific NonMaximumSuppression node from here (probably slightly changed).

mmdetection model without the specific NMS op for TensorRT can be run without problems in CPU EP, CUDA EP and OpenVINO EP (OV tested with 1.17.0 Pypi package).

CUDA from https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run CUDNN from https://developer.nvidia.com/downloads/compute/cudnn/secure/8.9.2/local_installers/11.x/cudnn-linux-x86_64-8.9.2.26_cuda11-archive.tar.xz/ TensorRT from https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.0.1/tars/TensorRT-10.0.1.6.Linux.x86_64-gnu.cuda-11.8.tar.gz

Code and model are included in the zip file, with compiled custom op (Ubuntu 24.04, GCC 11.4). Tested with Python 3.10. Runnable from run.py with LD_LIBRARY_PATH set to correct paths (can use ld_library.py script as a help).

If needed, I could try to make a docker image that reproduces the bug.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 24.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.8.0, CUDNN 8.9.2.26, TensorRT 10.0.1.6

yf711 commented 6 days ago

Hi @MiroPsota Thanks for bringing up this issue! Does the model with NMS op could run on previous version of ONNXRuntime+TRT ? Also could you share the standard model (without nms) that you tested on CPU/GPU/OpenVINO EP?

MiroPsota commented 6 days ago

Zip with all the models and updated run.py.

1.18.0 and 1.18.1 - the mentioned problem occurs.

I used TensorRT 8.6.1.6 from here for 1.17.1 tests (ORT gpu pypi package). OOM doesn't occur, but another error occurs (op not implemented) and it adds not wanted copy operations to a host and back. See the log. I will investigate further.

The ONNX model for TensorRT can be run without problems with mmdeploy, which uses TensorRT directly. One difference is that TRTBatchedNMS had originally a different domain, which I changed to trt.plugins according to docs, so it can be run in ORT.

MiroPsota commented 19 hours ago

I think the problem is with TensorRT 10.0, probably the same as mentioned here.

I solved it with compiling ORT 1.18.1 with TRT 8.6.1 and --use_tensorrt_oss_parser. More info here.