Does PowerNorm still work for NMT task after removing the GroupScaling layer?

sIncerass / powernorm

[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845

GNU General Public License v3.0

119 stars 17 forks source link

Does PowerNorm still work for NMT task after removing the GroupScaling layer? #9

Closed CheerM closed 3 years ago

CheerM commented 3 years ago

Hi, PN is an interesting work and the performance reported in the manuscript is exciting.

However, I'm wondering that whether the PN still works after removing GroupScaling or not? As described in the manuscript, GroupScaling seems like a trick to improve the performance, while it's actually a kind of variant of LayerNorm and probably plays a key role in the architecture.

Would you mind showing the ablation study that removing the GroupScaling from PN?

sIncerass commented 3 years ago

Hi there, yeah, we have the GroupScaling there for stabilizing the training. We have IWSLT results without GroupScaling, which perform around 35.0 and comparable to LayerNorm.

knsong commented 2 years ago

Hi there, yeah, we have the GroupScaling there for stabilizing the training. We have IWSLT results without GroupScaling, which perform around 35.0 and comparable to LayerNorm.

Hi，are there any details about the ablation study that removing the GroupScaling from PN?

sIncerass commented 2 years ago

Hi @knsong. thanks for asking! We do not have that part in our paper, but we have preliminary experiments as in the previous response.

knsong commented 2 years ago

We replace LayerNorm in the network for image denoising task with PowerNorm(without the group scaling considering the efficiency for PowerNorm), but get NAN after ~1000 iterations. Do you have any idea about that? @sIncerass