nbasyl / OFQ

The official implementation of the ICML 2023 paper OFQ-ViT
MIT License
27 stars 0 forks source link

The activation was never quantized in the code OR WaS iT? #5

Closed temporydan closed 5 months ago

temporydan commented 7 months ago

In the qlinear.py there are a lot of code snippets like this: input = self.input_quant_fn(input) # activation quantization input = self.move_aft(input) # add a FP bias out = nn.functional.linear(input, weight) or: input = self.input_quant_fn(input) # activation quantization input = self.move_aft(input) # add a FP bias out = nn.functional.conv2d(input, weight, self.bias, self.stride, self.padding, self.dilation, self.groups).

My point is: after you quantized the input to n-bit, a FP bias is added on it. Then what is the point of the input quantization??? How it can be activation quantization???? seriously? ICML?

nbasyl commented 6 months ago

@temporydan, if we have low-bit weights (W) and low-bit activations (X) along with a floating-point bias (FP_bias), it is possible to calculate the product of W and FP_bias separately before performing inference as W X + W FP_bias. This still allows us to accelerate the computation of W * X when deploying the quantized model. This practice is commonly used in Quantization-Aware Training (QAT). For more details, you can refer to the paper at https://arxiv.org/pdf/2003.03488.pdf. In the future, please make sure to do your research before leaving comments.

temporydan commented 6 months ago

Thank you so much for your comments. However, can you please show us how this bias can accelerate the Conv op? Conv(x+bias)!=Conv(x)+Conv(bias). In practice, this added bias is meaningless on the Conv op. Or you may say we can use im2col to hijack the Conv op. Have you tried? GEMM+im2col is much slower than the Conv op. Besides, you need to store a float16 copy of the unrolled bias for the input x. Using these tricks to improve the accuracy is meaningless.