Open vonJJ opened 1 year ago
Our converter and graph optimizer have some hiccups dealing with the QAT model, where they failed to pick the most optimized path during inferencing. This is not something that can be fixed in a couple of days.
swapped.zip We manually modified the model and attached the results. you can give it a try while we try to find a solution. The tool we use for manual model modification is https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/onnx2py.py
swapped.zip We manually modified the model and attached the results. you can give it a try while we try to find a solution. The tool we use for manual model modification is https://github.com/microsoft/onnxconverter-common/blob/master/onnxconverter_common/onnx2py.py
Ok, thanks very mcuh, and I will have a try on it.
Our converter and graph optimizer have some hiccups dealing with the QAT model, where they failed to pick the most optimized path during inferencing. This is not something that can be fixed in a couple of days.
Hi chenfucn, do you have any convert tool to convert quantized model from QDQ format to QOperator format? Thank you!
@hoangtv2000 This is an issue with convertor as in https://github.com/onnx/tensorflow-onnx/issues/2121 You can continue follow up there. Thanks!
Describe the issue
I haved a pre-trained CNN model of tensorflow saved model and I convert it to .onnx form as well as a static quantized .onnx form, and their inference latency at the same environment is respectively 7ms and 3ms. Then I tried to use Quantization-Aware training in tensorflow model optimization on the pre-trained model and then convert it to quantized .tflite form. After that I converted .tflite model to .onnx model use tf2onnx --tflite, and I think it was converted to a quantized format by default. So I test the model, its inference latency is about 7ms (as same as normal onnx float32 model), but its size is same as the satic quantized model (about a quarter of the float32 model)
According to QDQ format model instruction in the onnxruntime docs:
ONNX quantization representation format There are two ways to represent quantized ONNX models:
Operator-oriented (QOperator). All the quantized operators have their own ONNX definitions, like QLinearConv, MatMulInteger and etc.
Tensor-oriented (QDQ; Quantize and DeQuantize). This format inserts DeQuantizeLinear(QuantizeLinear(tensor)) between the original operators to simulate the quantization and dequantization process. In Static Quantization, the QuantizeLinear and DeQuantizeLinear operators also carry the quantization parameters. In Dynamic Quantization, a ComputeQuantizationParameters function proto is inserted to calculate quantization parameters on the fly. Models generated in the following ways are in the QDQ format:
For the latter two cases, you don’t need to quantize the model with the quantization tool. ONNX Runtime can run them directly as a quantized model.
I think my model is obviously match the second or the third one, but why is its inference latency still same as the float32 one?
To reproduce
converter = tf.lite.TFLiteConverter.from_keras_model(quantization_aware_training_model) converter.optimizations = [tf.lite.Optimize.DEFAULT] quantized_tflite_model = converter.convert() quant_model_name = quantized.tflite with open(quant_model_name, 'wb') as f: f.write(quantized_tflite_model)
then convert tflite model in shell: python -m tf2onnx.convert --tflite quantized.tflite --output quantized.onnx
Urgency
It is better to be solved within two days.
Platform
Linux
OS Version
gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
7bf08b04c0fda775e5836fbc04bd0024fdd94bb4
ONNX Runtime API
Python 3.8
Architecture
X86_64
Execution Provider
Default CPU
Execution Provider Library Version
tensorflow 2.8.0
Model File
quantized_onnx.zip quantized_tflite.zip
Is this a quantized model?
Yes