[Help]: Loss NaN occured while training VALL-E during the second stage (the NAR decoder)

open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.

https://openhlt.github.io/amphion/

MIT License

4.28k stars 364 forks source link

[Help]: Loss NaN occured while training VALL-E during the second stage (the NAR decoder) #211

Closed Ming-er closed 2 weeks ago

Ming-er commented 1 month ago

Sorry to bother you. I've been working on training the VALL-E model from scratch using LibriTTS. I followed the scripts to prepare the training data. Due to hardware limitations (1 x RTX 4090), I used fp16 and set the batch_size to 1 to train the AR decoder for 20 epochs, with other configs remaining unchanged. The training/validation loss finally converged to about 2.79/3.06 after the last epoch, respectively. However, when I used the trained AR decoder from the last epoch to subsequently train the NAR decoder, the loss started normal, fluctuating between 15 and 5, but then turned to NaN during the last half of epoch 0. Can you help me to find out the problem? I am looking forward to your reply.

jiaqili3 commented 1 month ago

Hi, thanks for using our repo! Encountering NaN loss may indicate that you use too large learning rates, or your data contains some out-of-distribution samples that breaks the normal optimization. I would suggest you try to decrease the learning rate, and clipping the grad norm would also help (using torch.nn.utils.clip_grad_norm_, you could search for how to do this).

We have a new version of VALL-E training codes coming soon, which could provide better performance and faster convergence. Stay tuned and I hope you'll like that!

jiaqili3 commented 2 weeks ago

Hi @Ming-er, our new valle release is out, and in our experiments we haven't met the NaN loss issue. The clip_grad_norm has also been added to its trainer. If you have further questions you're welcome to reopen the issue. Thanks!