megvii-research / FQ-ViT

[IJCAI 2022] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer
Apache License 2.0
301 stars 48 forks source link

A bug of LayerNorm #31

Closed Ther-nullptr closed 1 year ago

Ther-nullptr commented 1 year ago

In the code of IntLayerNorm, The author may ignore the situation when $\sigma=0$, the layernorm function $y = \frac{x-\mathbb{E}(x)}{\sigma}$ may generate NaN. So I think the Intlayernorm function must deal with the situation when $\sigma=0$ properly.

linyang-zhh commented 1 year ago

Yes, our IntLayerNorm ignores the situation of $\sigma=0$ as Eq. (27) on our paper. We think that situation can be ignored since it means all values of one FeatureMap are same. Further, if all values are same, this sample will have no information for recognition.

Despite all this, if you're worried about that, we recommend to check the FeatureMap before LayerNorm to ensure $\sum{i=1,...,C}(|C\widehat{X}{Qi} -M_1| ) > 0$, where $M_1$ is the sum of $\widehat{X}_Q$ as Eq. (24).

Ther-nullptr commented 1 year ago

I tried to use your method to quantify A-ViT(https://a-vit.github.io/) in my lab project, and since a mask is taken in A-ViT for a portion of the token, this will cause all feature maps for a specific token to be all 0, which will definitely result in NaN at intlayernorm. I believe there are not few ViT variants that take this strategy, so it is very much hoped that you can propose a solution to make this quantization method more scalable.

linyang-zhh commented 1 year ago

To my understanding, if the features of one token are 0, its values after LayerNorm should also be kept as 0. So, maybe you can save the mask and skip these corresponding tokens in LayerNorm. Or you can set these corresponding tokens features to 0 after LayerNorm.

Ther-nullptr commented 1 year ago

Understand, thank you!