Closed sliedes closed 7 years ago
Hi @sliedes , Thanks for report. I am also facing the same problem. Once I fix it, I will update the info here. Any further inspection about it will be appreciated!
While I understand neither the code or the theory fully yet, I think the problem is in ScaledDotProductAttention.forward(). It sets the masked values to -Inf before passing them to nn.SoftMax. I think nn.SoftMax does not deal well with -Inf. Indeed, while I got nan loss most of the time already during the first epoch, when I modify ScaledDotProductAttention.forward() to set the masked values to -100.0 instead of -Inf, I have now trained for five epochs without seeing nans.
Hi @sliedes , Sorry for the late update! You are right. The wrong part is about the softmax function, but it is not because of the -Inf value. I misplace the k/q pair in the attention mask calculation routine. (see 94aae68) Please pull the newest commit to fix this bug. Thanks you!
I think this fix will eliminate the NaN error, and I am sorry for the confusion so far. Let me close this issue now. However, if there emerges other fatal NaN error, feel free to open another issue.
Hi, it seems that the problem still exist when using the solution above.
Training and validation loss is nan (using commit e21800a6):