cjvolzka commented 9 months ago

QLinearAdd Operator Request

Describe the operator

An Add Operator for quantized data. It supports zero_point and scale input tensor for the Add input (A and B) and output (C) tensors: This Op exists in ONNXRuntime time but not in the ONNX standard operators https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.QLinearAdd

Several int8.onnx models in the onnx model zoo validated directory user this Op. Pretty much all of them but one which have QlinearMatMul also have QLinearAdd.

Onnx-mlir would like to support these models but we only support official ONNX Operators.

Can this operator be constructed using existing onnx operators?

Unsure.

Is this operator used by any model currently? Which one?

Several offhand I found in the onnx model zoo: bvlcalexnet-12-int8 mnist-12-int8 vgg16-12-int8

Are you willing to contribute it? (Y/N)

N

Notes

ONNX already has QLinearMatMul and QLinearConv from these models but appears to be missing QLinearAdd.

justinchuby commented 9 months ago

Can this be represented using the Dequantize-Add-Quantize pattern?

cjvolzka commented 9 months ago

No. Dequantizing would turn them back into floats. So you'd loose both the memory savings and performance improvements of integer math on the operation. You'd also have to recalculate scales to requantize which would add time. Also using the QLinear* Ops are for models that were quantized at training time. So dequantizing and requantizing would incur accuracy hits hat would harm the point of the "quantization aware" training of the the original model.

Also from what I've seen, when you have QLinear Ops you start with a QuantizeLinear. Do a series of QLinear Ops on the values and then afterward there's a DequantizeLinear at the end. Everything between the QuantizeLinear and DequantizeLinear should stay quantized and use scales and offsets set at training time.

gramalingam commented 9 months ago

Just as a background explanation: there has been a shift towards using the pattern Justin describes: at the model level, an op on quantized tensor(s) is expressed as "Dequantize => op => Quantize" first. Then, a backend can rewrite this pattern into a "CustomQuantizedOp" if it has support for doing so.

The reason was to avoid introducing QLinearX for many different X ops (like Add, Mul, Sub, Div, Relu, etc.). Which would be very disruptive. However, if the industry converges on ops that are worth explicitly supporting in quantized form, they may be worth adding at some point. I am not sure we are there yet. But opinions welcome.

onnx / onnx

QLinearAdd Op Request #5895