Why use group scaling? - Githubissues

sIncerass / powernorm

[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845

GNU General Public License v3.0

119 stars 17 forks source link

Why use group scaling? #8

Closed htwang14 closed 3 years ago

htwang14 commented 3 years ago

Hi Sheng,

I see there is a GroupScaling1D operation at the very beginning of the PN layer. If I understand correctly, the GroupScaling1D operation scales the input feature across the channel dimension. I'm kind of confused why this is necessary. Seems this operation is not mentioned in the paper. Is it a standard way to preprocess features in NLP tasks?

Thanks in advance!

Haotao

sIncerass commented 3 years ago

Hi Haotao,

Yeah, we include the GroupScaling and mentioned that in the implementation details section (Appendix). We found for LM and MT tasks, it helps with stabilizing the training so we add it there. Feel free to change it or raise any questions.

Best, Sheng

htwang14 commented 3 years ago

Thank you!