Closed mingmingtasd closed 4 years ago
Modify the segmentation demo code to only collect the inference time , and run this demo on ICL. As the screenshot below, for CPU, the original model cost 99.4 ms, quantized models using two algorithms (DefaultQuantization and AccuracyAwareQuantization) cost 68.3 ms and 69.4 ms. So the inference will speed up about 1.4X. @huningxin @ibelem
For analyzing the performance optimization,I tried to compare the per layer performance of the model before and after quantization. While the native segementation demo uses MKLDNNPlugin 2.1, not ONEDNN, the Verbose Mode can't be used for collection of basic statistics like execution time and primitive parameters. I will tried another way using the openvino tool DL Workbench.
In this mixed quantized deeplabv3 model, only three types of ops have been quantized: GroupConvolution, Convolution and Add.
Why only a part of ops have been quantized? I invesgated to know two reasons now for me:
How to quantize these ops to int8 via OpenVINO? You can use Intel® Post-Training Optimization Toolkit tool to quantize the model. The quantization process adds FakeQuantize layers on activations and weights for most layers. FakeQuantize is element-wise linear quantization of floating-point input values into a discrete set of floating-point values. Fake in FakeQuantize means the output tensor is of the same floating point type as an input tensor, not integer type.
What are the ops need to be supported in our example? We need support two ops to enable the quantized deeplabv3 model: FakeQuantize and GroupConvolution.
Attach the two pictures to compare the FP32 deeplabv3 model and mixed quantized deeplabv3 model as below. You will see that above every op which need to be quantized, there will be a FakeQuantize op. What's more, for Convolution and GroupConvolution, some ops like const, convert and reshape are added additionally.
FP32 deeplabv3 model: mixed quantized deeplabv3:
@ibelem @huningxin
Please note that the OpenVINO has been upgraded to 2020.3 with some bugs fixed. Base on this version, I re-tried to run the DL Workbench . I tested the inference times of deeplabv3 original model and quantized model on i7-8700K CPU. It shows that the quantization can achieve 1.23X performance. I also verified with Benchmark C++ App Sample to get the same result. Please see the details shown as below. I think we also need to test the performance on ICL becuase the inference process depends on the CPU device.
DL Workbench Result:
Benchmark C++ App:
Convolution FullyConnected ReLU ReLU6 Reshape Permute Pooling Squeeze Eltwise Concat Resample MVN
This means that 8-bit inference can only be performed with the CPU plugin on the layers listed above. All other layers are executed in the format supported by the CPU plugin: 32-bit floating point format (fp32).
Precision | Layer Type |
---|---|
FP32/I32/I8 | Const |
FP32 | Reorder |
FP32 | Interp |
FP32/I32 | Output |
FP32 | Topk |
FP32/U8 | Quantize |
FP32 | Power |
I32 | Reshape |
I32 | Permute |
FP32/I8/U8 | Reorder |
U8 | Input |
FP32/U8 | Convolution |
U8 | Concatenation |
Starting with the 2020.1 version, OpenVINO™ toolkit delivers the Post-Training Optimization Tool designed to accelerate the inference of DL models by converting them into a more hardware-friendly representation by applying specific methods that do not require re-training, for example, post-training quantization. For more details about the low-precision flow in OpenVINO™, refer to the Low Precision Optimization Guide.