[Performance] DequantizeLinear, pad and QuantizeLinear operation is not fused

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.12k stars 2.85k forks source link

[Performance] DequantizeLinear, pad and QuantizeLinear operation is not fused #21496

Open flytair opened 1 month ago

flytair commented 1 month ago

Describe the issue

The DequantizeLinear, pad, and QuantizeLinear operations in the statically quantized model using the optimization level ORT_ENABLE_EXTENDED are not fused into one operation. My understanding is that the pad operator should be independent of data types, so I don't understand why DequantizeLinear and QuantizeLinear are needed for dequantization and quantization before and after the pad operation, as shown in the figure below.

To reproduce

statically quantized the model which includes pad operation using the optimization level ORT_ENABLE_EXTENDED

Urgency

No response

Platform

Windows

OS Version

windows11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.16.3

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

enc_int8_static_extended_opt.zip

Is this a quantized model?

Yes

fdwr commented 1 month ago

Should {DQ, Pad, Q} be fused, or elided into simply {Pad}? (similarly, I could see many other operators like slice where the preceding and following DQ and Q are elidable)

flytair commented 1 month ago

Should {DQ, Pad, Q} be fused, or elided into simply {Pad}? (similarly, I could see many other operators like slice where the preceding and following DQ and Q are elidable)

as my understanding, the pad operator is independent to data type, so the Q and DQ operator is not necessary. and anyone can correct me?

fdwr commented 1 month ago

Should {DQ, Pad, Q} be fused, or elided into simply {Pad}? (similarly, I could see many other operators like slice where the preceding and following DQ and Q are elidable)

as my understanding, the pad operator is independent to data type, so the Q and DQ operator is not necessary. and anyone can correct me?

I see your screenshot shows different scales and zero points for the entering DQ and exiting Q, meaning that would at least require the pad to be followed by a linear rescaling and adjustment, rather than complete elision. (Yufeng Li knows much more about this than I do)

flytair commented 1 month ago

Should {DQ, Pad, Q} be fused, or elided into simply {Pad}? (similarly, I could see many other operators like slice where the preceding and following DQ and Q are elidable)

as my understanding, the pad operator is independent to data type, so the Q and DQ operator is not necessary. and anyone can correct me?

I see your screenshot shows different scales and zero points for the entering DQ and exiting Q, meaning that would at least require the pad to be followed by a linear rescaling and adjustment, rather than complete elision. (Yufeng Li knows much more about this than I do)

To remove the quantization and dequantization (Q&DQ) operations and add a linear rescaling after padding (pad), does it need code modification, or are there any existing high-level interfaces that can be reused?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.