microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.73k stars 2.94k forks source link

The effect of turning optimization on and off on quantized model performance #11576

Open yeliang2258 opened 2 years ago

yeliang2258 commented 2 years ago

Describe the bug

I have a quantized model. When all optimizations are turned on, the accuracy is found to drop by 5 points, but when all optimizations are turned off, the accuracy is not dropped. What could be the cause? Looking forward to your reply.

System information

To Reproduce

when turn on all optimization, the accuracy is found to drop by 5 points:

onnx_model = "mobilenet_onnx_quant_model.onnx"
sess_options = rt.SessionOptions()
sess_options.optimized_model_filepath = "./optimize_model.onnx"
sess = rt.InferenceSession(onnx_model, sess_options, providers = ['CPUExecutionProvider'])

when turn off all optimization, the accuracy is not dropped:

onnx_model = "mobilenet_onnx_quant_model.onnx"
sess_options = rt.SessionOptions()
sess_options.graph_optimization_level = rt.GraphOptimizationLevel.ORT_DISABLE_ALL
sess_options.optimized_model_filepath = "./optimize_model.onnx"
sess = rt.InferenceSession(onnx_model, sess_options, providers = ['CPUExecutionProvider'])

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging. My quantized model: mobilenet_onnx_quant_model.onnx.zip

yeliang2258 commented 2 years ago

I found that almost all quantized models have this phenomenon. Turning off the optimization can align with the float model. Turning on the optimization leads to a drop in accuracy.

yufenglee commented 2 years ago

What kind of CPU do you run the model on? Could you please try quantizing the model with u8u8 format(both activation and weight uint8) https://onnxruntime.ai/docs/performance/quantization.html#when-and-why-do-i-need-to-try-u8u8 ?

yeliang2258 commented 2 years ago

my cpu type is:Intel(R) Xeon(R) Gold 6271C

yufenglee commented 2 years ago

my cpu type is:Intel(R) Xeon(R) Gold 6271C

It doesn't have VNNI instructions. How do you generate the model? With ORT quantization tool or tf2onnx? If you are using ORT quantization tool, could you please try quantizing the model with u8u8 format and see if the performance gets better?

yeliang2258 commented 2 years ago

my cpu type is:Intel(R) Xeon(R) Gold 6271C

It doesn't have VNNI instructions. How do you generate the model? With ORT quantization tool or tf2onnx? If you are using ORT quantization tool, could you please try quantizing the model with u8u8 format and see if the performance gets better?

Thank you for your reply. I ran it again and found that there is indeed no problem of precision drop on the VNNI machine. Also, may I ask, symmetric quantization can be converted to a u8u8 format ONNX quantize model?

yufenglee commented 2 years ago

my cpu type is:Intel(R) Xeon(R) Gold 6271C

It doesn't have VNNI instructions. How do you generate the model? With ORT quantization tool or tf2onnx? If you are using ORT quantization tool, could you please try quantizing the model with u8u8 format and see if the performance gets better?

Thank you for your reply. I ran it again and found that there is indeed no problem of precision drop on the VNNI machine. Also, may I ask, symmetric quantization can be converted to a u8u8 format ONNX quantize model?

Thaks for your confirmation! So, you convert quantized model from TFLite. Yes, it can be converted to u8u8. You can do it by replacing int8 zeropoint in Q/DQ with uint8 by adding 128, and similar process to weight of Conv and Gemm/MatMul.

We will add an option to run a s8s8 model with u8u8 kernels on x64 natively in ORT for this kind of case.