megvii-research / FQ-ViT

[IJCAI 2022] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Apache License 2.0
304 stars 47 forks source link

What is the purpose of clamping zero point in the range of qmin and qmax? #3

Closed airacid closed 2 years ago

airacid commented 2 years ago

Hi, thanks for the wonderful works of your paper and code. I was looking into your code and I couldn't understand why you need to clamp the zero point in the range of qmin and qmax. I lack knowledge of this field and hope that you can explain it for me, please.

https://github.com/linyang-zhh/FQ-ViT/blob/16122ee7ea33e80aed3edd29cfebb3ab2ce2cb69/models/ptq/observer/minmax.py#L49

linyang-zhh commented 2 years ago

@airacid Hi, thanks for your recognition of our work. zero_point needs to be stored in the corresponding data type, such as uint8, so we must ensure that it conforms to the corresponding range to avoid overflow.

airacid commented 2 years ago

@airacid Hi, thanks for your recognition of our work. zero_point needs to be stored in the corresponding data type, such as uint8, so we must ensure that it conforms to the corresponding range to avoid overflow.

Thank you for the quick reply. I've seen that some works like https://github.com/skmhrk1209/QuanTorch/blob/804269b8261560130039550d521efabaa1a87f48/quantizers.py save their 'zero_point' and 'scale' in the float type. So, I wonder why the implementations are different? I suppose their work is not fully quantized in an integer type, but yours is? Sorry for the lots of questions.

linyang-zhh commented 2 years ago

@airacid Hi, thanks for your recognition of our work. zero_point needs to be stored in the corresponding data type, such as uint8, so we must ensure that it conforms to the corresponding range to avoid overflow.

Thank you for the quick reply. I've seen that some works like https://github.com/skmhrk1209/QuanTorch/blob/804269b8261560130039550d521efabaa1a87f48/quantizers.py save their 'zero_point' and 'scale' in the float type. So, I wonder why the implementations are different? I suppose their work is not fully quantized in an integer type, but yours is? Sorry for the lots of questions.

The work you cited is aimed at CNNs. In that case, zero_point of feature maps can be fused into the bias of Conv, and then the new bias can be rounded to int. However, this is not allowed in some modules of the transformer, such as q@k in the self-attention module and the calculation of the mean and variance in our IntLayerNorm. So in our work, we save zero_ point as uint8.