huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
819 stars 61 forks source link

QLinear quantised scale tensor #137

Closed mdatres closed 7 months ago

mdatres commented 7 months ago

Dear all,

I have noticed that the quantised weight of QLinear module are QTensors with a scale parameter of dimension out_features. Should it not be a scalar value in the case of linear modules (per-channel quantization makes sense only for convolution and channels-like activation outputs)? Otherwise, how is the quantised operation performed?

Thanks in advance! Max

shuokay commented 7 months ago

If you are familiar with convolution, then you can take the linear as convolution with kernel (1, 1)

shuokay commented 7 months ago

output_int8 = s_w * s_x / s_o * Linear(W_int8, X_int8), s_x and s_o should be per-tensor, no matter s_w is per-channel or per-tensor, just broadcast s_w * s_x / s_o

dacorvo commented 7 months ago

@mdatres, in linear layers, the weights contributing to each output feature are applied independently to each input, just like in convolutions. Hence the per-axis. The only constraint is when quantizing inputs, that need to be per-tensor because channels are multiplied and added together.

mdatres commented 7 months ago

Thank you all for the clear answers!