Closed mdatres closed 7 months ago
If you are familiar with convolution, then you can take the linear as convolution with kernel (1, 1)
output_int8 = s_w * s_x / s_o * Linear(W_int8, X_int8)
, s_x and s_o should be per-tensor, no matter s_w is per-channel or per-tensor, just broadcast s_w * s_x / s_o
@mdatres, in linear layers, the weights contributing to each output feature are applied independently to each input, just like in convolutions. Hence the per-axis. The only constraint is when quantizing inputs, that need to be per-tensor because channels are multiplied and added together.
Thank you all for the clear answers!
Dear all,
I have noticed that the quantised weight of QLinear module are QTensors with a scale parameter of dimension out_features. Should it not be a scalar value in the case of linear modules (per-channel quantization makes sense only for convolution and channels-like activation outputs)? Otherwise, how is the quantised operation performed?
Thanks in advance! Max