k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
931 stars 295 forks source link

inf pruned_loss in tot_loss #1011

Closed nullscc closed 7 months ago

nullscc commented 1 year ago

Hi, I've trained a stateless7_streaming model on Gigaspeech, and finetuned on my own about 2700 hours data, finetune code modified according to #944 . I use on the fly feats by the way.

But found that a lot of inf pruned_loss in tot_loss as below:

2023-04-18 21:26:04,349 INFO [finetune.py:966] (3/4) Epoch 0, batch 500, loss[loss=1.196, simple_loss=1.08, pruned_loss=0.77, over 11669.00 frames. ], tot_loss[loss=inf, simple_loss=2.379, pruned_loss=inf, over 2150353.73 frames. ], batch size: 361, lr: 5.00e-05, grad_scale: 0.25

Also attach my log file(submitted using slurm and deleted some sensitive text) here. finetune_sgeng.log

I have finetuned 5 epochs before. But found it becomes worse compared to original checkpoint which I finetuned from.

It may have some problem with my data. But could anyone give some advices on how to debug and fix this? Thanks.

marcoyang1998 commented 1 year ago

But found that a lot of inf pruned_loss in tot_loss as below:

Does this still happen if you train a model from scratch using your own data?

The validation loss is going down from your log, it would be more helpful if you can provide a full log. Also, have you tried decoding with earlier epochs/checkpoints? The model can overfit very quickly.

Also, I noticed that you are fine-tuning from epoch-40.pt. If possible, could you finetune from an averaged checkpoint? This might not be crucial, but it may give finetuning a bit more robustness.

danpovey commented 1 year ago

The overall loss can become inf due to just one bad example, e.g. because fp16 training caused something to go out of numerical range. I believe in more recent versions of icefall, the code that tracks the loss was changed to get rid of this inf in the running-total loss. Make sure to use a smaller learning rate than the original run started with. Also you'll want the learning rate to decrease more slowly than normal, which can be achieved by setting --lr-batches to an extremely large number (then it will only decrease by epochs).

yaozengwei commented 1 year ago

You could get rid of samples whose loss values are infinite in training. Please refer to https://github.com/k2-fsa/icefall/blob/45c13e90e42d0f6ff190d69acb18f4e868bfa954/egs/librispeech/ASR/pruned_transducer_stateless4/train.py#L634-L664