TensorRT Execution Provider build fail while using TensorRT successfully create engine file

namtr92 commented 1 year ago

Describe the issue

Hi, I am using ONNX Runtime with TensorRT Execution Provider for a quantized model (YOLO-NAS). While TensorRT cli (trtexec.exe) successfully build the engine from onnx model, the ONNX Runtime with TensorRT Execution Provider cannot build the engine file. Here is the output of TensorRT Execution Provider:

2023-08-28 14:32:23.7685241 [W:onnxruntime:Default, tensorrt_execution_provider.h:75 onnxruntime::TensorrtLogger::log] [2023-08-28 07:32:23 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
2023-08-28 14:32:29.8989822 [W:onnxruntime:Default, tensorrt_execution_provider.h:75 onnxruntime::TensorrtLogger::log] [2023-08-28 07:32:29 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
2023-08-28 14:32:29.9544855 [W:onnxruntime:Default, tensorrt_execution_provider.cc:1934 onnxruntime::TensorrtExecutionProvider::Compile] [TensorRT EP] Try to use DLA core, but platform doesn't have any DLA core
2023-08-28 14:32:30.0863651 [E:onnxruntime:Default, tensorrt_execution_provider.h:73 onnxruntime::TensorrtLogger::log] [2023-08-28 07:32:30 ERROR] 2: [graphOptimizer.cpp::nvinfer1::builder::`anonymous-namespace'::chooseHigherPrecision::7174] Error Code 2: Internal Error (Assertion dt1 == DataType::kFLOAT || dt1 == DataType::kHALF failed. )

To reproduce

ort_session = onnxruntime.InferenceSession( "model.onnx", providers=['TensorrtExecutionProvider'], provider_options=[{'device_id': '0', 'trt_int8_enable':True}] )

link to download model: [https://drive.google.com/file/d/1ZxT2wCU0bIjYrKmhDuFgKRew1kR9hbRd/view?usp=sharing]

Urgency

Very Urgent

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

TensorRT

Execution Provider Library Version

CUDA 11.6, CUDNN 8.9.0, TENSORRT 8.6.1

jywu-msft commented 1 year ago

@yf711 can you help take a look?

yf711 commented 1 year ago

The issue is confirmed and I can repro it in local env.

I also tried trtexec --int8 --onnx=model.onnx --saveEngine=model.trt and it could pass. Will check why this chooseHigherPrecision was executed when trt_int8_enable was selected

yf711 commented 1 year ago

Hi @namtr92 Could you try this workaround by disabling ORT graph optimization while initiating session?

import onnxruntime
sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
ort_session = onnxruntime.InferenceSession(
    "model.onnx",
    sess_options,
    providers=['TensorrtExecutionProvider'],
    provider_options=[{'device_id': '0',
                       'trt_int8_enable': True,
                       'trt_engine_cache_enable': True
                      }]

According to the key context shared by Nvidia, there might be overlaps between ORT's graph optimization and TRT's QDQ optimization. I will check if QDQ graph optimization could be skipped when TRT EP is selected

namtr92 commented 1 year ago

Hi @namtr92 Could you try this workaround by disabling ORT graph optimization while initiating session?
import onnxruntime
sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
ort_session = onnxruntime.InferenceSession(
    "model.onnx",
    sess_options,
    providers=['TensorrtExecutionProvider'],
    provider_options=[{'device_id': '0',
                       'trt_int8_enable': True,
                       'trt_engine_cache_enable': True
                      }]
According to the key context shared by Nvidia, there might be overlaps between ORT's graph optimization and TRT's QDQ optimization. I will check if QDQ graph optimization could be skipped when TRT EP is selected

Thank for your help, now I could use TensorRT EP normally !

microsoft / onnxruntime