[Bug] quantized tflite openvino conversion results in high latency

openvinotoolkit / openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference

https://docs.openvino.ai

Apache License 2.0

7.28k stars 2.27k forks source link

[Bug] quantized tflite openvino conversion results in high latency #18260

Closed mnl12 closed 1 year ago

mnl12 commented 1 year ago

Hello,

I first converted the mobilenetv3 large tensorflow model with post-training quantization to full integer according to https://www.tensorflow.org/lite/performance/post_training_quantization. Then I converted it directly with the new release of Openvino 2023.0. However, the latency is much higher compared to normal model. The benchmark_app -m model.xml on CPU results in Average:40.61 ms, while the normal (not quantized) tf model openvino conversion takes around 18ms. I was wondering if you have any suggestions on what may cause the problem. Thanks. Tensorflow 2.9.1 Openvino 2023.0

mnl12 commented 1 year ago

Hello, I was wondering if you have any updates on the issue, or if you need me to send the quantized model. Thanks.

avitial commented 1 year ago

@mnl12 please share the base model and quantized model in the original framework format for us to take a look. Also have you tried the Basic Quantization flow to apply quantization to the model?

mnl12 commented 1 year ago

Thanks for your reply, Regarding the basic quantization flow, our model can not be converted by nncf, we had already opened an issue in the link https://github.com/openvinotoolkit/nncf/issues/1570#issuecomment-1430862097 which concludes that our model can not be converted with the current version of nncf. I attached the quantised model and the original in tensorflow format as you asked, for the quant model in my computer benchmark_app -m delg_quant_model_in8.xml -shape [1,512,512,3] results in Average: 29.35 ms while the normal model is Average: 5.31 ms delg_quant_model_in8.tflite.zip

model_6_13_1.zip om/openvinotoolkit/openvino/files/12026461/delg_quant_model_in8.tflite.zip)

mnl12 commented 1 year ago

Hello,

I was wondering if you have any updates or potential solutions that I can implement. Thanks.

avitial commented 1 year ago

@mnl12 I suggest to take a look at the _benchmarkapp with -pc to take a look at the performance counters and execution time per layer.

In my observations it seems the optimized quantized model ends up with additional layers (of type Reorder and Pad) not present in the optimized non-quantized model, that take additional run time (cumulative +10.75 ms longer). Also the Subgraph and Add type layers take longer to execute in the quantized model (cumulative +7.19 ms longer) vs the non-quantized model.

Note I was able to get additional FPS (92.13 vs 76.48) with latency mode (-hint latency with _benchmarkapp) but the quantized model is still slower when compared to the non-quantized model.

avitial commented 1 year ago

Closing this, I hope previous responses were sufficient to help you proceed. Feel free to reopen and provide additional information or ask any questions related to this topic.