Open namtr92 opened 1 year ago
@yf711 can you help take a look?
The issue is confirmed and I can repro it in local env.
I also tried trtexec --int8 --onnx=model.onnx --saveEngine=model.trt
and it could pass.
Will check why this chooseHigherPrecision
was executed when trt_int8_enable
was selected
Hi @namtr92 Could you try this workaround by disabling ORT graph optimization while initiating session?
import onnxruntime
sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL
ort_session = onnxruntime.InferenceSession(
"model.onnx",
sess_options,
providers=['TensorrtExecutionProvider'],
provider_options=[{'device_id': '0',
'trt_int8_enable': True,
'trt_engine_cache_enable': True
}]
According to the key context shared by Nvidia, there might be overlaps between ORT's graph optimization and TRT's QDQ optimization. I will check if QDQ graph optimization could be skipped when TRT EP is selected
Hi @namtr92 Could you try this workaround by disabling ORT graph optimization while initiating session?
import onnxruntime sess_options = onnxruntime.SessionOptions() sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL ort_session = onnxruntime.InferenceSession( "model.onnx", sess_options, providers=['TensorrtExecutionProvider'], provider_options=[{'device_id': '0', 'trt_int8_enable': True, 'trt_engine_cache_enable': True }]
According to the key context shared by Nvidia, there might be overlaps between ORT's graph optimization and TRT's QDQ optimization. I will check if QDQ graph optimization could be skipped when TRT EP is selected
Thank for your help, now I could use TensorRT EP normally !
Describe the issue
Hi, I am using ONNX Runtime with TensorRT Execution Provider for a quantized model (YOLO-NAS). While TensorRT cli (trtexec.exe) successfully build the engine from onnx model, the ONNX Runtime with TensorRT Execution Provider cannot build the engine file. Here is the output of TensorRT Execution Provider:
2023-08-28 14:32:23.7685241 [W:onnxruntime:Default, tensorrt_execution_provider.h:75 onnxruntime::TensorrtLogger::log] [2023-08-28 07:32:23 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
2023-08-28 14:32:29.8989822 [W:onnxruntime:Default, tensorrt_execution_provider.h:75 onnxruntime::TensorrtLogger::log] [2023-08-28 07:32:29 WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
2023-08-28 14:32:29.9544855 [W:onnxruntime:Default, tensorrt_execution_provider.cc:1934 onnxruntime::TensorrtExecutionProvider::Compile] [TensorRT EP] Try to use DLA core, but platform doesn't have any DLA core
2023-08-28 14:32:30.0863651 [E:onnxruntime:Default, tensorrt_execution_provider.h:73 onnxruntime::TensorrtLogger::log] [2023-08-28 07:32:30 ERROR] 2: [graphOptimizer.cpp::nvinfer1::builder::`anonymous-namespace'::chooseHigherPrecision::7174] Error Code 2: Internal Error (Assertion dt1 == DataType::kFLOAT || dt1 == DataType::kHALF failed. )
To reproduce
ort_session = onnxruntime.InferenceSession( "model.onnx", providers=['TensorrtExecutionProvider'], provider_options=[{'device_id': '0', 'trt_int8_enable':True}] )
link to download model: [https://drive.google.com/file/d/1ZxT2wCU0bIjYrKmhDuFgKRew1kR9hbRd/view?usp=sharing]
Urgency
Very Urgent
Platform
Windows
OS Version
11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.15.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
TensorRT
Execution Provider Library Version
CUDA 11.6, CUDNN 8.9.0, TENSORRT 8.6.1