Closed fclearner closed 1 year ago
AccuracyAwareQuantization algorithm is aimed at accurate quantization and allows the model’s accuracy to stay within the pre-defined range. This may cause a degradation in performance in comparison to DefaultQuantization algorithm because some layers can be reverted back to the original precision.
AccuracyAwareQuantization algorithm is aimed at accurate quantization and allows the model’s accuracy to stay within the pre-defined range. This may cause a degradation in performance in comparison to DefaultQuantization algorithm because some layers can be reverted back to the original precision.
Thanks for the explanation, I tried the default quantization first with terrible accuracy results so I'm using accuracy-aware quantization now. Compared to XML after accuracy-aware quantization and XML after default quantization, the tuning is mainly focused on the model input part, which I suspect is why model performance has not changed. Is there any other method to help the model gain good performance with little accuracy loss after quantization? I found the Q&A talked about quantization-aware training
@AlexKoff88, @MaximProshin, please advice here
Best regards, Roman
@fclearner, as far as I understood you are not satisfied with the inference performance after applying both Default and AccuracyAware quantization. Is that right?
I found that the model itself is relatively small. This can be the reason why you don't see any speedup after quantization. I wonder what HW you are using for benchmarking. My take is that you will be able to see performance benefits when running many instances of the model in parallel (multi-stream execution) or using a low-power CPU. But there can be some issues in the runtime that prevents running this model efficiently.
Can you maybe run the OpenVINO benchmark_app on your host with "-pc" flag and post the output here?
cc'ed @dmitry-gorokhov
@fclearner, as far as I understood you are not satisfied with the inference performance after applying both Default and AccuracyAware quantization. Is that right?
I found that the model itself is relatively small. This can be the reason why you don't see any speedup after quantization. I wonder what HW you are using for benchmarking. My take is that you will be able to see performance benefits when running many instances of the model in parallel (multi-stream execution) or using a low-power CPU. But there can be some issues in the runtime that prevents running this model efficiently.
Can you maybe run the OpenVINO benchmark_app on your host with "-pc" flag and post the output here?
cc'ed @dmitry-gorokhov
@AlexKoff88, Thanks for the advice, As you suggested, I have tested the openvino benchmark app, and the log is listed below: 310_quant.log shows the accuracy-aware quantized model result. 310_base.log shows the baseline result. it seems the quantized model has got better performance, The ASR(automatic speech recognition ) relies on the "acoustic model" and "language model" to infer procedures, my quantization experiment is based on the "acoustic model" and the performance test is on the whole ASR infer procedure, and the performance test HW is for the whole asr infer test, I have not considered verifying the acoustic model quant performance first and I will conduct more tests to further evaluate the model performance. 310_baseline.log 310_quant.log
Thanks for sharing. This data makes sense to me. As I mentioned you probably will not see a significant performance boost on a powerful CPU in the latency oriented scenario since the model is already fast enough. But in the the throughput scenario you can expect a decent speedup as you can see.
Closing this, I hope previous responses were sufficient to help you proceed. Feel free to reopen and ask additional questions related to this topic.
System information (version)
Detailed description
I tried to use the pot tool to quantize the kaldi tdnn model, the tdnn model is convert to Ir10 openvino model before quantize, and I'm using the accuracy aware quantization, it successfully converts the 294M(memory size) model to the 75M model, the accuracy performs good(nearly absolutely 0.5% decrease), however, the RTF performs bad: 0.759->0.784
here is my model XML: 310.base.xml.zip