Closed CheerM closed 3 years ago
Hi there, yeah, we have the GroupScaling there for stabilizing the training. We have IWSLT results without GroupScaling, which perform around 35.0 and comparable to LayerNorm.
Hi there, yeah, we have the GroupScaling there for stabilizing the training. We have IWSLT results without GroupScaling, which perform around 35.0 and comparable to LayerNorm.
Hi,are there any details about the ablation study that removing the GroupScaling from PN?
Hi @knsong. thanks for asking! We do not have that part in our paper, but we have preliminary experiments as in the previous response.
We replace LayerNorm in the network for image denoising task with PowerNorm(without the group scaling considering the efficiency for PowerNorm), but get NAN after ~1000 iterations. Do you have any idea about that? @sIncerass
Hi, PN is an interesting work and the performance reported in the manuscript is exciting.
However, I'm wondering that whether the PN still works after removing GroupScaling or not? As described in the manuscript, GroupScaling seems like a trick to improve the performance, while it's actually a kind of variant of LayerNorm and probably plays a key role in the architecture.
Would you mind showing the ablation study that removing the GroupScaling from PN?