microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.26k stars 2.87k forks source link

[Performance]why is the inference latency of onnx QDQ quantized model converted from tflite quantized model (or from tensorflow Quantization-Aware training (QAT) model) as same as normal onnx float32 model? #14707

Open vonJJ opened 1 year ago

vonJJ commented 1 year ago

Describe the issue

I haved a pre-trained CNN model of tensorflow saved model and I convert it to .onnx form as well as a static quantized .onnx form, and their inference latency at the same environment is respectively 7ms and 3ms. Then I tried to use Quantization-Aware training in tensorflow model optimization on the pre-trained model and then convert it to quantized .tflite form. After that I converted .tflite model to .onnx model use tf2onnx --tflite, and I think it was converted to a quantized format by default. So I test the model, its inference latency is about 7ms (as same as normal onnx float32 model), but its size is same as the satic quantized model (about a quarter of the float32 model)

According to QDQ format model instruction in the onnxruntime docs:

ONNX quantization representation format There are two ways to represent quantized ONNX models:

  1. Models quantized by quantize_static or quantize_dynamic API, explained below, with quant_format=QuantFormat.QDQ.
  2. Quantization-Aware training (QAT) models converted from Tensorflow or exported from PyTorch.
  3. Quantized models converted from TFLite and other frameworks.

For the latter two cases, you don’t need to quantize the model with the quantization tool. ONNX Runtime can run them directly as a quantized model.

I think my model is obviously match the second or the third one, but why is its inference latency still same as the float32 one?

To reproduce

converter = tf.lite.TFLiteConverter.from_keras_model(quantization_aware_training_model) converter.optimizations = [tf.lite.Optimize.DEFAULT] quantized_tflite_model = converter.convert() quant_model_name = quantized.tflite with open(quant_model_name, 'wb') as f: f.write(quantized_tflite_model)

then convert tflite model in shell: python -m tf2onnx.convert --tflite quantized.tflite --output quantized.onnx

Urgency

It is better to be solved within two days.

Platform

Linux

OS Version

gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

7bf08b04c0fda775e5836fbc04bd0024fdd94bb4

ONNX Runtime API

Python 3.8

Architecture

X86_64

Execution Provider

Default CPU

Execution Provider Library Version

tensorflow 2.8.0

Model File

quantized_onnx.zip quantized_tflite.zip

Is this a quantized model?

Yes

chenfucn commented 1 year ago

Our converter and graph optimizer have some hiccups dealing with the QAT model, where they failed to pick the most optimized path during inferencing. This is not something that can be fixed in a couple of days.

chenfucn commented 1 year ago

swapped.zip We manually modified the model and attached the results. you can give it a try while we try to find a solution. The tool we use for manual model modification is https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/onnx2py.py

vonJJ commented 1 year ago

swapped.zip We manually modified the model and attached the results. you can give it a try while we try to find a solution. The tool we use for manual model modification is https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/onnx2py.py

Ok, thanks very mcuh, and I will have a try on it.

hoangtv2000 commented 3 months ago

Our converter and graph optimizer have some hiccups dealing with the QAT model, where they failed to pick the most optimized path during inferencing. This is not something that can be fixed in a couple of days.

Hi chenfucn, do you have any convert tool to convert quantized model from QDQ format to QOperator format? Thank you!

chenfucn commented 3 months ago

@hoangtv2000 This is an issue with convertor as in https://github.com/onnx/tensorflow-onnx/issues/2121 You can continue follow up there. Thanks!