openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference
https://docs.openvino.ai
Apache License 2.0
7.3k stars 2.27k forks source link

After pot quantization RTF no improvement #17768

Closed fclearner closed 1 year ago

fclearner commented 1 year ago
System information (version)
Detailed description

I tried to use the pot tool to quantize the kaldi tdnn model, the tdnn model is convert to Ir10 openvino model before quantize, and I'm using the accuracy aware quantization, it successfully converts the 294M(memory size) model to the 75M model, the accuracy performs good(nearly absolutely 0.5% decrease), however, the RTF performs bad: 0.759->0.784

here is my model XML: 310.base.xml.zip

Wan-Intel commented 1 year ago

AccuracyAwareQuantization algorithm is aimed at accurate quantization and allows the model’s accuracy to stay within the pre-defined range. This may cause a degradation in performance in comparison to DefaultQuantization algorithm because some layers can be reverted back to the original precision.

fclearner commented 1 year ago

AccuracyAwareQuantization algorithm is aimed at accurate quantization and allows the model’s accuracy to stay within the pre-defined range. This may cause a degradation in performance in comparison to DefaultQuantization algorithm because some layers can be reverted back to the original precision.

Thanks for the explanation, I tried the default quantization first with terrible accuracy results so I'm using accuracy-aware quantization now. Compared to XML after accuracy-aware quantization and XML after default quantization, the tuning is mainly focused on the model input part, which I suspect is why model performance has not changed. Is there any other method to help the model gain good performance with little accuracy loss after quantization? I found the Q&A talked about quantization-aware training

rkazants commented 1 year ago

@AlexKoff88, @MaximProshin, please advice here

Best regards, Roman

AlexKoff88 commented 1 year ago

@fclearner, as far as I understood you are not satisfied with the inference performance after applying both Default and AccuracyAware quantization. Is that right?

I found that the model itself is relatively small. This can be the reason why you don't see any speedup after quantization. I wonder what HW you are using for benchmarking. My take is that you will be able to see performance benefits when running many instances of the model in parallel (multi-stream execution) or using a low-power CPU. But there can be some issues in the runtime that prevents running this model efficiently.

Can you maybe run the OpenVINO benchmark_app on your host with "-pc" flag and post the output here?

cc'ed @dmitry-gorokhov

fclearner commented 1 year ago

@fclearner, as far as I understood you are not satisfied with the inference performance after applying both Default and AccuracyAware quantization. Is that right?

I found that the model itself is relatively small. This can be the reason why you don't see any speedup after quantization. I wonder what HW you are using for benchmarking. My take is that you will be able to see performance benefits when running many instances of the model in parallel (multi-stream execution) or using a low-power CPU. But there can be some issues in the runtime that prevents running this model efficiently.

Can you maybe run the OpenVINO benchmark_app on your host with "-pc" flag and post the output here?

cc'ed @dmitry-gorokhov

@AlexKoff88, Thanks for the advice, As you suggested, I have tested the openvino benchmark app, and the log is listed below: 310_quant.log shows the accuracy-aware quantized model result. 310_base.log shows the baseline result. it seems the quantized model has got better performance, The ASR(automatic speech recognition ) relies on the "acoustic model" and "language model" to infer procedures, my quantization experiment is based on the "acoustic model" and the performance test is on the whole ASR infer procedure, and the performance test HW is for the whole asr infer test, I have not considered verifying the acoustic model quant performance first and I will conduct more tests to further evaluate the model performance. 310_baseline.log 310_quant.log

AlexKoff88 commented 1 year ago

Thanks for sharing. This data makes sense to me. As I mentioned you probably will not see a significant performance boost on a powerful CPU in the latency oriented scenario since the model is already fast enough. But in the the throughput scenario you can expect a decent speedup as you can see.

avitial commented 1 year ago

Closing this, I hope previous responses were sufficient to help you proceed. Feel free to reopen and ask additional questions related to this topic.