Closed Ther-nullptr closed 1 year ago
Yes, our IntLayerNorm ignores the situation of $\sigma=0$ as Eq. (27) on our paper. We think that situation can be ignored since it means all values of one FeatureMap are same. Further, if all values are same, this sample will have no information for recognition.
Despite all this, if you're worried about that, we recommend to check the FeatureMap before LayerNorm to ensure $\sum{i=1,...,C}(|C\widehat{X}{Qi} -M_1| ) > 0$, where $M_1$ is the sum of $\widehat{X}_Q$ as Eq. (24).
I tried to use your method to quantify A-ViT(https://a-vit.github.io/) in my lab project, and since a mask is taken in A-ViT for a portion of the token, this will cause all feature maps for a specific token to be all 0, which will definitely result in NaN
at intlayernorm. I believe there are not few ViT variants that take this strategy, so it is very much hoped that you can propose a solution to make this quantization method more scalable.
To my understanding, if the features of one token are 0, its values after LayerNorm should also be kept as 0. So, maybe you can save the mask and skip these corresponding tokens in LayerNorm. Or you can set these corresponding tokens features to 0 after LayerNorm.
Understand, thank you!
In the code of IntLayerNorm, The author may ignore the situation when $\sigma=0$, the layernorm function $y = \frac{x-\mathbb{E}(x)}{\sigma}$ may generate NaN. So I think the Intlayernorm function must deal with the situation when $\sigma=0$ properly.