Can I get inference speed gain using quantization and pruning.

microsoft / Windows-Machine-Learning

Samples and Tools for Windows ML.

https://docs.microsoft.com/en-us/windows/ai/

MIT License

1.03k stars 446 forks source link

Closed WilliamZhaoz closed 5 years ago

WilliamZhaoz commented 5 years ago

Hi,

Can I get inference run time speed gain using quantization in WinMLTools.
Can I get inference run time speed gain after I pruned my model.(weight prune, in other words, there are more 0 in weight.) envs: OS: windows server 2019 processor: Intel(R) Xeon(R) CPU E5-2673 v4 @2.30GHz 2.29GHz python env: a anaconda virtual env. Thanks.

martinb35 commented 5 years ago

Hi @WilliamZhaoz , thanks for your questions.

Quantization from WinMLTools is primarily for reducing the size of the model. It can actually make the inference run slightly slower. Please use winmlrunner to verify, though.
I think you're asking about sparse tensor support, which is not currently in WinML.

-Brian

nlml commented 3 years ago

It seems in TensorFlow, quantization can afford speed ups on CPU (https://www.tensorflow.org/lite/performance/post_training_quantization). I have also seen this in academic literature.

Are there any plans to improve the inference of quantized models on CPU to afford such speedups?

Cheers Liam