intel / webml-polyfill

Deprecated, the Web Neural Network Polyfill project has been moved to https://github.com/webmachinelearning/webnn-polyfill
Apache License 2.0
161 stars 42 forks source link

[example] Use Post-Training Optimization Toolkit to quantize deeplabv3 model #1239

Closed mingmingtasd closed 4 years ago

mingmingtasd commented 4 years ago

Starting with the 2020.1 version, OpenVINO™ toolkit delivers the Post-Training Optimization Tool designed to accelerate the inference of DL models by converting them into a more hardware-friendly representation by applying specific methods that do not require re-training, for example, post-training quantization. For more details about the low-precision flow in OpenVINO™, refer to the Low Precision Optimization Guide.

mingmingtasd commented 4 years ago

Modify the segmentation demo code to only collect the inference time , and run this demo on ICL. As the screenshot below, for CPU, the original model cost 99.4 ms, quantized models using two algorithms (DefaultQuantization and AccuracyAwareQuantization) cost 68.3 ms and 69.4 ms. So the inference will speed up about 1.4X. @huningxin @ibelem

Screenshot from 2020-05-19 16-04-51

ibelem commented 4 years ago

Link to https://github.com/intel/webml-polyfill/issues/790

mingmingtasd commented 4 years ago

For analyzing the performance optimization,I tried to compare the per layer performance of the model before and after quantization. While the native segementation demo uses MKLDNNPlugin 2.1, not ONEDNN, the Verbose Mode can't be used for collection of basic statistics like execution time and primitive parameters. I will tried another way using the openvino tool DL Workbench.

mingmingtasd commented 4 years ago

In this mixed quantized deeplabv3 model, only three types of ops have been quantized: GroupConvolution, Convolution and Add.

Why only a part of ops have been quantized? I invesgated to know two reasons now for me:

  1. OpenVINO quantization depends on specific libraries and devices. MKlDNN supported several int8 primitives now: convolution, pooling, eltwise, sum, concat, reorder. Please get more details from this page1 and this page2.
  2. OpenVINO pot quantization tool need gaurantee the accuracy of quantized model. If the quantization of some op brings significant reduction, the op will not be quantized.

How to quantize these ops to int8 via OpenVINO? You can use Intel® Post-Training Optimization Toolkit tool to quantize the model. The quantization process adds FakeQuantize layers on activations and weights for most layers. FakeQuantize is element-wise linear quantization of floating-point input values into a discrete set of floating-point values. Fake in FakeQuantize means the output tensor is of the same floating point type as an input tensor, not integer type.

What are the ops need to be supported in our example? We need support two ops to enable the quantized deeplabv3 model: FakeQuantize and GroupConvolution.

Attach the two pictures to compare the FP32 deeplabv3 model and mixed quantized deeplabv3 model as below. You will see that above every op which need to be quantized, there will be a FakeQuantize op. What's more, for Convolution and GroupConvolution, some ops like const, convert and reshape are added additionally.

FP32 deeplabv3 model: Screenshot from 2020-05-26 14-56-47 mixed quantized deeplabv3: Screenshot from 2020-05-26 14-56-26

@ibelem @huningxin

mingmingtasd commented 4 years ago

Please note that the OpenVINO has been upgraded to 2020.3 with some bugs fixed. Base on this version, I re-tried to run the DL Workbench . I tested the inference times of deeplabv3 original model and quantized model on i7-8700K CPU. It shows that the quantization can achieve 1.23X performance. I also verified with Benchmark C++ App Sample to get the same result. Please see the details shown as below. I think we also need to test the performance on ICL becuase the inference process depends on the CPU device.

DL Workbench Result:

image

image

Benchmark C++ App:

image

image

mingmingtasd commented 4 years ago
  1. Current Inference Engine solution for low-precision inference uses Intel MKL-DNN and supports inference of the following layers in 8-bit integer computation mode:

Convolution FullyConnected ReLU ReLU6 Reshape Permute Pooling Squeeze Eltwise Concat Resample MVN

This means that 8-bit inference can only be performed with the CPU plugin on the layers listed above. All other layers are executed in the format supported by the CPU plugin: 32-bit floating point format (fp32).

  1. And for the default quantized deeplabv3 runtime graph on CPU, layers with the precision are listed below:
Precision Layer Type
FP32/I32/I8 Const
FP32 Reorder
FP32 Interp
FP32/I32 Output
FP32 Topk
FP32/U8 Quantize
FP32 Power
I32 Reshape
I32 Permute
FP32/I8/U8 Reorder
U8 Input
FP32/U8 Convolution
U8 Concatenation