Open mengniwang95 opened 2 years ago
TRT quantization on GPU is a little different from CPU. Please refer to these examples, https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/object_detection/trt/yolov3/e2e_user_yolov3_example.py https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/image_classification/trt/resnet50/e2e_tensorrt_resnet_example.py For CNN model, calibration approach should be good enough and no QDQ model is needed as shown in the examples.
Hi, I also try the link you shared and it can run successfully. I want to learn how to generate a QDQ model which can run with TRT EP. Dose it have some restrictions?
Describe the issue
2022-10-20 09:21:09.531367276 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2022-10-20 09:21:09 ERROR] 4: [network.cpp::validate::2891] Error Code 4: Internal Error (Int8 precision has been set for a layer or layer output, but int8 is not configured in the builder) Exception Traceback (most recent call last): File "/workspace/mengniwa/test/tools/transformers/benchmark_helper.py", line 135, in create_onnxruntime_session session = InferenceSession(onnx_model_path, sess_options, providers=providers) File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 347, in init self._create_inference_session(providers, provider_options, disabled_optimizers) File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 395, in _create_inference_session sess.initialize_session(providers, provider_options, disabled_optimizers) onnxruntime.capi.onnxruntime_pybind11_state.EPFail: [ONNXRuntimeError] : 11 : EP_FAIL : TensorRT EP could not build engine for fused node: TensorrtExecutionProvider_TRTKernel_graph_mxnet_converted_model_6241742266258321209_75_0
To reproduce
https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/image_classification/cpu Generate a qdq rn50 model with above link(generated is too large to upload), I add in quantize_static API, use QDQ quant format, per-tensor, int8 weight.
Then I build TRT EP enabled ORT with latest main branch, then I use https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/profiler.py to do test, disable all optimization
Urgency
urgent
Platform
Other / Unknown
OS Version
https://docs.nvidia.com/deeplearning/tensorrt/container-release-notes/rel_22-07.html#rel_22-07
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
67150baa8d50e74f3a8a7b8e679a8d31eae4c0ed
ONNX Runtime API
Python
Architecture
X86
Execution Provider
TensorRT
Execution Provider Library Version
No response