Loss becomes NAN after a while

OleguerCanal commented 2 years ago

Environment info

Platform: Ubuntu20.04
Python version: 3.8.10
PyTorch version (GPU?): 1.9.0+cu111 (pytorch lightning 1.5.8)
Using GPU in script?: 4x A100

Information

Model I am using (ListenAttendSpell, Transformer, Conformer ...): conformer_lstm

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

To reproduce

Steps to reproduce the behavior: I can't seem to reproduce it in the example dataset

Expected behavior

The model is training on a very large dataset. A priori everything seems to be behaving correctly: loss, wer, cer going down and such. However, all of a sudden, the loss randomly goes to NAN from which is impossible to recover. Do you guys have any ideas or suggestions?

I added a snipped that zeros the gradients in case of nan loss, so as not to update the model. However, the nans start to appear more frequently as training progresses. Effectively rendering useless most of the training steps.

sooftware commented 2 years ago

Can you attach the log? cc. @upskyy

OleguerCanal commented 2 years ago

Hey, after some experimenting I managed to avoid it.

I'm not 100% sure but I believe the issue was training with 16-bit precision. Do you think this might be the case?

JaeungHyun commented 2 years ago

In my case, it happened with 32bit precision

openspeech-team / openspeech