Closed nullscc closed 7 months ago
But found that a lot of inf pruned_loss in tot_loss as below:
Does this still happen if you train a model from scratch using your own data?
The validation loss is going down from your log, it would be more helpful if you can provide a full log. Also, have you tried decoding with earlier epochs/checkpoints? The model can overfit very quickly.
Also, I noticed that you are fine-tuning from epoch-40.pt. If possible, could you finetune from an averaged checkpoint? This might not be crucial, but it may give finetuning a bit more robustness.
The overall loss can become inf due to just one bad example, e.g. because fp16 training caused something to go out of numerical range. I believe in more recent versions of icefall, the code that tracks the loss was changed to get rid of this inf in the running-total loss. Make sure to use a smaller learning rate than the original run started with. Also you'll want the learning rate to decrease more slowly than normal, which can be achieved by setting --lr-batches to an extremely large number (then it will only decrease by epochs).
You could get rid of samples whose loss values are infinite in training. Please refer to https://github.com/k2-fsa/icefall/blob/45c13e90e42d0f6ff190d69acb18f4e868bfa954/egs/librispeech/ASR/pruned_transducer_stateless4/train.py#L634-L664
Hi, I've trained a stateless7_streaming model on Gigaspeech, and finetuned on my own about 2700 hours data, finetune code modified according to #944 . I use on the fly feats by the way.
But found that a lot of
inf
pruned_loss
intot_loss
as below:Also attach my log file(submitted using slurm and deleted some sensitive text) here. finetune_sgeng.log
I have finetuned 5 epochs before. But found it becomes worse compared to original checkpoint which I finetuned from.
It may have some problem with my data. But could anyone give some advices on how to debug and fix this? Thanks.