openspeech-team / openspeech

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.
https://openspeech-team.github.io/openspeech/
MIT License
672 stars 112 forks source link

Loss becomes NAN after a while #139

Open OleguerCanal opened 2 years ago

OleguerCanal commented 2 years ago

Environment info

Information

Model I am using (ListenAttendSpell, Transformer, Conformer ...): conformer_lstm

The problem arises when using:

To reproduce

Steps to reproduce the behavior: I can't seem to reproduce it in the example dataset

Expected behavior

The model is training on a very large dataset. A priori everything seems to be behaving correctly: loss, wer, cer going down and such. However, all of a sudden, the loss randomly goes to NAN from which is impossible to recover. Do you guys have any ideas or suggestions?

I added a snipped that zeros the gradients in case of nan loss, so as not to update the model. However, the nans start to appear more frequently as training progresses. Effectively rendering useless most of the training steps.

sooftware commented 2 years ago

Can you attach the log? cc. @upskyy

OleguerCanal commented 2 years ago

Hey, after some experimenting I managed to avoid it.

I'm not 100% sure but I believe the issue was training with 16-bit precision. Do you think this might be the case?

JaeungHyun commented 2 years ago

In my case, it happened with 32bit precision