Failed to run inference session on 8bit quantized onnx model

shairoz-deci commented 3 years ago

Describe the bug Created an 8bit quantization model following https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb and got onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for the node Conv_0_quant:ConvInteger(10) upon trying to run inference session.

System information

OS Platform and Distribution Linux Ubuntu 16.04:
ONNX Runtime installed from source
ONNX Runtime version: 1.5.2
Python version: 3.75
CUDA/cuDNN version: 10.1

To Reproduce

convert vanilla resnet model to onnx.
run quantize_onnx_model from the above link -> models are created properly and weigh as expected.
run

sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads=1
session = onnxruntime.InferenceSession(model_path, sess_options)

crashed with the above exception.

Expected behavior running inference on quantized model

yufenglee commented 3 years ago

@shairoz-deci , for ConvInteger, we have yet to add u8s8 (activation: uint8, weight: int8). Currently, only support u8u8. In general, for CNN models, it is recommended to use static quantization. Here is an example: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/cpu

xizexi commented 3 years ago

@shairoz-deci , for ConvInteger, we have yet to add u8s8 (activation: uint8, weight: int8). Currently, only support u8u8. In general, for CNN models, it is recommended to use static quantization. Here is an example: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/cpu

So, for BERT or Transformer models, it is recommended to use dynamic quantization?

yufenglee commented 3 years ago

@shairoz-deci , for ConvInteger, we have yet to add u8s8 (activation: uint8, weight: int8). Currently, only support u8u8. In general, for CNN models, it is recommended to use static quantization. Here is an example: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/cpu

So, for BERT or Transformer models, it is recommended to use dynamic quantization?

dynamic quantization is easy to use. For most causes, dynamic quantization can get good accuracy for Transformer based model and you don't need to retrain. You can also retrain with QAT and then use static quantization.

xizexi commented 3 years ago

Thanks! @yufenglee

bxing-groq commented 3 years ago

Is there a reason we don't support int8 activations? If I'd like to contribute, where should I start looking?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.

microsoft / onnxruntime

Failed to run inference session on 8bit quantized onnx model #6430