Quantized model extra node emitted between Q-DQ pair

chenfucn commented 1 year ago

Describe the bug

When converting a quantized tflite mode to onnx, extra nodes (e.g. transpose, re-shape, etc.) got emitted between Q-DQ pairs. This prevents ORT graph optimizer to effectively fuse operators and achieve good performance.

Original issue from https://github.com/microsoft/onnxruntime/issues/14707

e.g. tflite model: converted onnx model:

The transpose node should be either before the QuantizeLinear node or after the DequantizeLinear node for ORT graph optimizer to work.

tflite mode: https://github.com/microsoft/onnxruntime/files/10751803/quantized_tflite.zip

converted onnx model https://github.com/microsoft/onnxruntime/files/10751800/quantized_onnx.zip

Urgency

System information

OS Platform and Distribution (e.g., Linux Ubuntu 18.04*):
TensorFlow Version:
Python version:
ONNX version (if applicable, e.g. 1.11*):
ONNXRuntime version (if applicable, e.g. 1.11*):

To Reproduce

Screenshots

Additional context

fatcat-z commented 1 year ago

Actually, this is a feature designed and implemented 2 years ago.

tf2onnx has an optimizer which will push DequantizeLinear down so that most of ops will be included between QuantizeLinear and DequantizeLinear pair. I guess the motivation was to lower down memory usage during inference.

Did you observe a big performance gap between the original onnx and the swapped onnx file mentioned in https://github.com/microsoft/onnxruntime/issues/14707?

If there is a big performance gap, probably we need to consider if this optimizer should be removed.

chenfucn commented 1 year ago

Yes, there is a huge performance drop when separation of Q-DQ node prevented operator fusion from working. for example:

https://github.com/microsoft/onnxruntime/issues/14707

an very simple model saw more than twice slower.

chenfucn commented 3 months ago

Hi folks, any update? @hoangtv2000

onnx / tensorflow-onnx

Quantized model extra node emitted between Q-DQ pair #2121