microsoft / onnxruntime-inference-examples

Examples for using ONNX Runtime for machine learning inferencing.
MIT License
1.13k stars 318 forks source link

My Quantized model not running faster than Unquantized model. #301

Open srikar242 opened 11 months ago

srikar242 commented 11 months ago

Hello @edgchen1 @wejoncy I tried to quantize the mars-model used in deepsort tracking. Using the example in image_classification/cpu I am able to quantize my mars model. Size of the model has reduced after quantization. But inference speed of the quantized model has not increased. It is very much similar to my unquantized model. What could be the problem here? I will mention the steps I did to quantize.

Firstly. mars model used in deepsort repo is a tensorflow .pb model. I took that model and then converted it into onnx format using tf2onnx utility. Now on this onnx model, I have applied static quantization as described in the example under quantization/image_classification/cpu. I successfully got the quantized onnx model which is smaller in size. But the issue is with inference speed which has not increased. Any help is appreciated.

tianleiwu commented 10 months ago

Not every model can get speed up by quantization. For example, it requires the model has heavy computation on MatMul or Convolution.

At least, you get some benefits like smaller model size and might have less memory consumption. You also need to take a look at accuracy.