openvinotoolkit / nncf

Neural Network Compression Framework for enhanced OpenVINO™ inference
Apache License 2.0
934 stars 233 forks source link

Quantify model acceleration #65

Closed blueskywwc closed 4 years ago

blueskywwc commented 4 years ago

Hello, how does the quantified model (int8) compare with the original model (fp32) in the acceleration of the inference process? Thank you!

vshampor commented 4 years ago

@kchechil

kchechil commented 4 years ago

here are some benchmarks if you convert to ONNX and then to OpenVINO IR and run inference on Intel hw with OpenVINO: https://docs.openvinotoolkit.org/latest/openvino_docs_performance_benchmarks.html. For hw supporting INT8 natively (e.g. Cascade Lake or Ice Lake family), performance boost is up to 4x.

blueskywwc commented 4 years ago

Thank you very much for your reply. I have another question: Can I switch to onnx model and then switch to ncnn? Thank you!

vshampor commented 4 years ago

Like PyTorch, NNCF only supports exporting models to ONNX. ncnn's README states that it supports ONNX models, so you should be good to go. However, extended quantization functionality such as non-INT8 quantization and mixed precision quantization are currently only propagated to ONNX via OpenVINO specific, non-ONNX-standard FakeQuantize nodes, so checkpoints with non-INT8 quantization will probably not be loadable into ncnn.

If you only use INT8 quantization for compression, or no quantization at all (i.e. only sparsity or filter pruning algorithms), you can set "export_to_onnx_standard_ops": true in the NNCF config file for the quantization algo part (as described at the bottom of https://github.com/openvinotoolkit/nncf_pytorch/blob/develop/docs/compression_algorithms/Quantization.md), and then the resulting ONNX model will have ONNX standard QuantizeLinear-DequantizeLinear nodes instead of FakeQuantize nodes to perform quantization. This configuration has better chances of being loadable into ncnn.

blueskywwc commented 4 years ago

Thank you very much. My purpose is to accelerate the inference process of the pytorch model by int8 quantification, and then deploy it to the mobile terminal. I have tried the quantization tutorial on the pytorch official website, but I can’t convert the quantized model to onnx, so I want to find out whether the pytorch model can achieve mobile acceleration through nncf to onnx and openvino. Thank you!