microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.84k stars 2.94k forks source link

TensorRT EP failed to set INT8 dynamic range. #13071

Open piedras77 opened 2 years ago

piedras77 commented 2 years ago

Describe the issue

I followed the tutorial on: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt to generate an int8 model. However, whenever I run inference, I get the following error: 2022-09-23 18:23:45.522261434 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2022-09-23 18:23:45 ERROR] Setting dynamic range is only allowed when there are no Q/DQ layers in the Network. 2022-09-23 18:23:45.522306939 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running TRTKernel_graph_torch-jit-export_16130601706149353436_1 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_16130601706149353436_1_0' Status Message: TensorRT EP failed to set INT8 dynamic range. EP Error: [ONNXRuntimeError] : 11 : EP_FAIL : Non-zero status code returned while running TRTKernel_graph_torch-jit-export_16130601706149353436_1 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_16130601706149353436_1_0' Status Message: TensorRT EP failed to set INT8 dynamic range. using ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']

Running on TensorRT is crucial for our application, due to performance requirements.

Calibration files: calibration.zip

I know these are two errors, but I would imagine the first one is causing the second one. I am not explicitly setting some dynamic range, so I am not sure what the issue is.

To reproduce

Tutorial on: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt

My scripts to generate the QDQ model follow the tutorial above. These are my own scripts (draft): calibration scripts.zip

I cannot share the calibration data

Urgency

Blocking release, since a better int8 performance is required

Platform

Linux

OS Version

18.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

Python

Architecture

X86

Execution Provider

TensorRT

Execution Provider Library Version

Build cuda_11.3.r11.3/compiler.29920130_0

jywu-msft commented 2 years ago

@chilo-ms , can you help take a look? thanks.

stevenlix commented 2 years ago

when the model is QDQ model, you should not use calibration table during inference because calibration info has been included in the Q/DQ nodes.

piedras77 commented 2 years ago

If I don't use calibration table during inference, I get this error: 2022-09-26 17:34:09.396046595 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-09-26 17:34:09 WARNING] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool. 2022-09-26 17:34:09.416590695 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2022-09-26 17:34:09 ERROR] 4: [standardEngineBuilder.cpp::initCalibrationParams::1420] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.) Segmentation fault (core dumped)

chilo-ms commented 2 years ago

Hi, Quick question here, is the QDQ model being generated successfully? Once the QDQ model is there, only set ORT_TENSORRT_FP16_ENABLE and ORT_TENSORRT_INT8_ENABLE and use this QDQ model to run the inference.

piedras77 commented 2 years ago

Hi,

Yes, the QDQ model is generated successfully. I am using the uint8_calibration.py script I shared above to generate this model.

Once the model is successfully generated, I only set the variables you mentioned, but get the Segmentation fault error I shared above.

chilo-ms commented 2 years ago

I saw the script you shared, and it looks good. Is it possible to share the model and QDQ model so that I can reproduce from our side? Probably use the random fixed if there is a concern of privacy.

stevenlix commented 2 years ago

Could you check your QDQ model to see if there are any INT8 tensors? QDQ model should not contain any other INT8 ops except Q/DQ.

piedras77 commented 2 years ago

Shared the models offline. Also, I checked the models, and I don't see any specific INT8 tensors.

lolwarmaze commented 1 year ago

Hi,

Yes, the QDQ model is generated successfully. I am using the uint8_calibration.py script I shared above to generate this model.

Once the model is successfully generated, I only set the variables you mentioned, but get the Segmentation fault error I shared above.

Looking at your script, when you quantize the model in qdq mode for tensorRT, you also need to provide an additional parameter for symmetric activations in the form of dictionary. Check quantization function in - https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/nlp/bert/trt/e2e_tensorrt_bert_example.py