Closed temporydan closed 5 months ago
@temporydan, if we have low-bit weights (W) and low-bit activations (X) along with a floating-point bias (FP_bias), it is possible to calculate the product of W and FP_bias separately before performing inference as W X + W FP_bias. This still allows us to accelerate the computation of W * X when deploying the quantized model. This practice is commonly used in Quantization-Aware Training (QAT). For more details, you can refer to the paper at https://arxiv.org/pdf/2003.03488.pdf. In the future, please make sure to do your research before leaving comments.
Thank you so much for your comments. However, can you please show us how this bias can accelerate the Conv op? Conv(x+bias)!=Conv(x)+Conv(bias). In practice, this added bias is meaningless on the Conv op. Or you may say we can use im2col to hijack the Conv op. Have you tried? GEMM+im2col is much slower than the Conv op. Besides, you need to store a float16 copy of the unrolled bias for the input x. Using these tricks to improve the accuracy is meaningless.
In the qlinear.py there are a lot of code snippets like this: input = self.input_quant_fn(input) # activation quantization input = self.move_aft(input) # add a FP bias out = nn.functional.linear(input, weight) or: input = self.input_quant_fn(input) # activation quantization input = self.move_aft(input) # add a FP bias out = nn.functional.conv2d(input, weight, self.bias, self.stride, self.padding, self.dilation, self.groups).
My point is: after you quantized the input to n-bit, a FP bias is added on it. Then what is the point of the input quantization??? How it can be activation quantization???? seriously? ICML?