microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.14k stars 2.85k forks source link

Failed to run inference session on 8bit quantized onnx model #6430

Open shairoz-deci opened 3 years ago

shairoz-deci commented 3 years ago

Describe the bug Created an 8bit quantization model following https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/quantization/notebooks/Bert-GLUE_OnnxRuntime_quantization.ipynb and got onnxruntime.capi.onnxruntime_pybind11_state.NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for the node Conv_0_quant:ConvInteger(10) upon trying to run inference session.

System information

To Reproduce

  1. convert vanilla resnet model to onnx.
  2. run quantize_onnx_model from the above link -> models are created properly and weigh as expected.
  3. run
sess_options = onnxruntime.SessionOptions()
sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.intra_op_num_threads=1
session = onnxruntime.InferenceSession(model_path, sess_options)

crashed with the above exception.

Expected behavior running inference on quantized model

yufenglee commented 3 years ago

@shairoz-deci , for ConvInteger, we have yet to add u8s8 (activation: uint8, weight: int8). Currently, only support u8u8. In general, for CNN models, it is recommended to use static quantization. Here is an example: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/cpu

xizexi commented 3 years ago

@shairoz-deci , for ConvInteger, we have yet to add u8s8 (activation: uint8, weight: int8). Currently, only support u8u8. In general, for CNN models, it is recommended to use static quantization. Here is an example: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/cpu

So, for BERT or Transformer models, it is recommended to use dynamic quantization?

yufenglee commented 3 years ago

@shairoz-deci , for ConvInteger, we have yet to add u8s8 (activation: uint8, weight: int8). Currently, only support u8u8. In general, for CNN models, it is recommended to use static quantization. Here is an example: https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/quantization/E2E_example_model/image_classification/cpu

So, for BERT or Transformer models, it is recommended to use dynamic quantization?

dynamic quantization is easy to use. For most causes, dynamic quantization can get good accuracy for Transformer based model and you don't need to retrain. You can also retrain with QAT and then use static quantization.

xizexi commented 3 years ago

Thanks! @yufenglee

bxing-groq commented 3 years ago

Is there a reason we don't support int8 activations? If I'd like to contribute, where should I start looking?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale due to inactivity and will be closed in 7 days if no further activity occurs. If further support is needed, please provide an update and/or more details.