kimiyoung / transformer-xl

Apache License 2.0
3.61k stars 762 forks source link

Sensitivity to initial weights causing NANs? #24

Open arvieFrydenlund opened 5 years ago

arvieFrydenlund commented 5 years ago

Hi, I'm getting NAN values in the first forward pass of the model (in the first layer), generally caused by the first AC calculation. I'm wondering if this is an issue with the initial weights of the model? If so, any advice to help with this issue? I have made some changes to the model and this will help me determine if this is a known issue or if I have introduced a bug. Thanks.

kimiyoung commented 5 years ago

This seldom happens. With the given hyper-parameters, this actually should not happen. However, when div_val > 1, meaning reducing the word embedding dimensionality by div_val times for infrequent words, this could happen with low probability according to my experience. If this happens to you, try using div_val = 1 or using smaller initial weights by decreasing init_range or init_std. Hope this helps.