Quantized model much slower than full precision model

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.59k stars 2.92k forks source link

Quantized model much slower than full precision model #5865

Open snippler opened 3 years ago

snippler commented 3 years ago

Describe the bug I had a full precision onnxruntime session. Then I loaded my network and quantized it by

from onnxruntime.quantization import quantize, QuantizationMode quantized_model = quantize(model, quantization_mode=QuantizationMode.IntegerOps)

Original model needed 1s (0.03s) for inference on CPU while the quantized model needs 10s (0.3s).

What can be the reason and how to change?

I tested on ARM and Intel (the time in brackets above) processors.

zhanghuanrong commented 3 years ago

This fully depends on models. Could you share more details?

snippler commented 3 years ago

I tested MobileFaceNet as implemented by that repository: https://github.com/wujiyang/Face_Pytorch It is mostly a MobileNetv2.

jay-karan commented 3 years ago

Hi @snippler, I had similar problems while using ResNet50 Models. Did you figure out any reason behind it? @zhanghuanrong, any help is highly appreciated. Thank you!

snippler commented 3 years ago

No progress on my side @jay-karan. No idea what can be the reason.