Open piedras77 opened 2 years ago
@chilo-ms , can you help take a look? thanks.
when the model is QDQ model, you should not use calibration table during inference because calibration info has been included in the Q/DQ nodes.
If I don't use calibration table during inference, I get this error:
2022-09-26 17:34:09.396046595 [W:onnxruntime:Default, tensorrt_execution_provider.h:60 log] [2022-09-26 17:34:09 WARNING] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool. 2022-09-26 17:34:09.416590695 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2022-09-26 17:34:09 ERROR] 4: [standardEngineBuilder.cpp::initCalibrationParams::1420] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.) Segmentation fault (core dumped)
Hi, Quick question here, is the QDQ model being generated successfully? Once the QDQ model is there, only set ORT_TENSORRT_FP16_ENABLE and ORT_TENSORRT_INT8_ENABLE and use this QDQ model to run the inference.
Hi,
Yes, the QDQ model is generated successfully. I am using the uint8_calibration.py script I shared above to generate this model.
Once the model is successfully generated, I only set the variables you mentioned, but get the Segmentation fault error I shared above.
I saw the script you shared, and it looks good. Is it possible to share the model and QDQ model so that I can reproduce from our side? Probably use the random fixed if there is a concern of privacy.
Could you check your QDQ model to see if there are any INT8 tensors? QDQ model should not contain any other INT8 ops except Q/DQ.
Shared the models offline. Also, I checked the models, and I don't see any specific INT8 tensors.
Hi,
Yes, the QDQ model is generated successfully. I am using the uint8_calibration.py script I shared above to generate this model.
Once the model is successfully generated, I only set the variables you mentioned, but get the Segmentation fault error I shared above.
Looking at your script, when you quantize the model in qdq mode for tensorRT, you also need to provide an additional parameter for symmetric activations in the form of dictionary. Check quantization function in - https://github.com/microsoft/onnxruntime-inference-examples/blob/main/quantization/nlp/bert/trt/e2e_tensorrt_bert_example.py
Describe the issue
I followed the tutorial on: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt to generate an int8 model. However, whenever I run inference, I get the following error:
2022-09-23 18:23:45.522261434 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2022-09-23 18:23:45 ERROR] Setting dynamic range is only allowed when there are no Q/DQ layers in the Network. 2022-09-23 18:23:45.522306939 [E:onnxruntime:, sequential_executor.cc:368 Execute] Non-zero status code returned while running TRTKernel_graph_torch-jit-export_16130601706149353436_1 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_16130601706149353436_1_0' Status Message: TensorRT EP failed to set INT8 dynamic range. EP Error: [ONNXRuntimeError] : 11 : EP_FAIL : Non-zero status code returned while running TRTKernel_graph_torch-jit-export_16130601706149353436_1 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch-jit-export_16130601706149353436_1_0' Status Message: TensorRT EP failed to set INT8 dynamic range. using ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider']
Running on TensorRT is crucial for our application, due to performance requirements.
Calibration files: calibration.zip
I know these are two errors, but I would imagine the first one is causing the second one. I am not explicitly setting some dynamic range, so I am not sure what the issue is.
To reproduce
Tutorial on: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/nlp/bert/trt
My scripts to generate the QDQ model follow the tutorial above. These are my own scripts (draft): calibration scripts.zip
I cannot share the calibration data
Urgency
Blocking release, since a better int8 performance is required
Platform
Linux
OS Version
18.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.12.1
ONNX Runtime API
Python
Architecture
X86
Execution Provider
TensorRT
Execution Provider Library Version
Build cuda_11.3.r11.3/compiler.29920130_0