Gradient overflow (NaN problem)

sIncerass / powernorm

[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845

GNU General Public License v3.0

119 stars 17 forks source link

Gradient overflow (NaN problem) #2

Closed CODEJIN closed 3 years ago

CODEJIN commented 4 years ago

Hi,

Thank you for your code! And, I have a question. I am trying to apply power normalization(PN) to Tacotron2. However, after I changed batch norm(BN) to PN, an overflow occurred after several thousands step training. When I checked, the MaskPowerNorm class's ema_gz parameter was smaller and smaller while training, and finally it became NaN. Is there any opinion or solution?

Thanks,

Heejo

sIncerass commented 4 years ago

Hi, thanks for your interest and sorry for the late reply. I would suggest to tune the \alpha_bkw parameter in the https://github.com/amirgholami/powernorm/blob/2f23ae75c4f29904175bfd2c6b8248399ff99440/fairseq/modules/norms/mask_powernorm.py#L103. The larger it is, the smaller variance it will introduce to the later training phase.

CODEJIN commented 4 years ago

Hi, thank you for your reply. However, the link you sent is not work for me. I saw Page not found...