Closed thaingoc01 closed 7 months ago
But after about 11,000 batches, I got ctc_loss=0, att_loss=nan, loss=nan problem.
Could you print out the batch that causes ctc_loss=0 and attn_loss=nan?
From the attached log (see above), it seemsit was training normallyuntil a certain point, then always ctc_loss=0, att_loss=nan. I suspect the model is always outputting inf, which gets filtered in the ctc loss computation but not the other loss computations. This seems to have happened suddenly.
2022-09-19 23:40:57,640 INFO [train.py:556] Epoch 0, batch 11300, loss[ctc_loss=0.3759, att_loss=0.4902, loss=0.4559, over 7457.00 frames. ], tot_loss[ctc_loss=0.405, att_loss=0.4955, loss=0.4683, over 1494451.87 frames. ], batch size: 28
2022-09-19 23:41:47,152 INFO [train.py:556] Epoch 0, batch 11350, loss[ctc_loss=0.356, att_loss=0.4576, loss=0.4271, over 6522.00 frames. ], tot_loss[ctc_loss=0.4023, att_loss=0.4919, loss=0.465, over 1484310.24 frames. ], batch size: 24
2022-09-19 23:42:35,838 INFO [train.py:556] Epoch 0, batch 11400, loss[ctc_loss=0, att_loss=nan, loss=nan, over 12124.00 frames. ], tot_loss[ctc_loss=0.3367, att_loss=nan, loss=nan, over 1562841.32 frames. ], batch size: 108
2022-09-19 23:43:23,191 INFO [train.py:556] Epoch 0, batch 11450, loss[ctc_loss=0, att_loss=nan, loss=nan, over 6951.00 frames. ], tot_loss[ctc_loss=0.2653, att_loss=nan, loss=nan, over 1543849.74 frames. ], batch size: 25
2022-09-19 23:44:09,710 INFO [train.py:556] Epoch 0, batch 11500, loss[ctc_loss=0, att_loss=nan, loss=nan, over 6107.00 frames. ], tot_loss[ctc_loss=0.2004, att_loss=nan, loss=nan, over 1590892.37
One possibility would be to change the training code check whether the sum of the encoder output is nan/inf (e.g.: x = encoder_output.sum().item(); if x - x != 0:
), and if it is, dump out the model and the optimizer state as a checkpoint and then exit. That would make it possible to look for inf's in the model and in the optimizer state, by looping over the keys of the model dict.
But after about 11,000 batches, I got ctc_loss=0, att_loss=nan, loss=nan problem.
Could you print out the batch that causes ctc_loss=0 and attn_loss=nan?
Sure. Thank you very much for your quick reply. I'm quite new to this field. How should I print out the batch? Check if ctc_loss=0 then print out logging.info(f"batch {batch_idx}, batch info{batch}") ?
Thank you! I also suspect it's the problem of inf of very short utterances (but I check duration of utterances or segments, no segment or utterance is less than 0.2 second and it has text inside like "YES", "UH", etc, I'm not sure if I should filter out these short utterances or segments, and which duration should I filter out? May be less than 0.5 second?
I try the training with subset of 140 hrs of data (all short utterances, duration 2-5 seconds). The training goes through and get reasonable WER.
How should I print out the batch?
Please have a look at https://github.com/k2-fsa/icefall/blob/9ae2f3a3c5a3c2336ca236c984843c0e133ee307/egs/librispeech/ASR/pruned_transducer_stateless2/train.py#L785
How should I print out the batch?
Please have a look at
Hi csukuangfj, I manage to print out the first batch that started to have ctc_loss=0. I attach the output in the file for your reference. The batch seems quite big. I guess because I set --max-duration 500 in training command ./conformer_ctc/train.py But if I reduce this --max-duration to default 120, the ctc_loss=0 still happened at later batches, and the number of batches are a lot and have many skipping batches. Thank you! [Uploading log_prepare_stage14_maxDur500_1GPU_02_debug_extract_batch.txt…]()
Sorry, I am not able to download it.
Sorry, I am not able to download it.
I upload it to Google Drive https://drive.google.com/file/d/1W4vCzn75Jjnh9hbeujWmkPvSmM6X-K_v/view?usp=sharing Looking at the batch I still have no idea.
to save it to a .pt
file?
Guys, I think this is not about the batch, I think this is an instability issue where there are NaN's or inf's appearing in the model. These will be filtered out by the CTC loss but not the other losses. I think the focus needs to be on the model parameters, not the batch.
I notice this code is printing out warnings:
batch_size = len(batch["supervisions"]["text"])
#Ref: https://github.com/k2-fsa/icefall/issues/320
if batch['inputs'].shape[0] != len(batch["supervisions"]["text"]):
logging.info("In train_one_epoch, skipping batch, batch_idx = " + str(batch_idx))
continue;
else:
.. this is odd. Do you have CutConcatenate enabled in the asr_datamodule.py? [but it might not be related to the error.]
CutConcatenate
Hi Dan, I check GigiaSpeech recipes, the --concatenate-cuts default setting is false and I don't see anywhere set it true again. I guess no CutConcatenate here. The if else part I added in follow the suggestion of issue #320 https://github.com/k2-fsa/icefall/issues/320 because I have the same Invalid Input Size at beginning.
Could you use
to save it to a
.pt
file?
Thanks, I will update you when it's ready.
Training conformer with ctc & pruned_rnnt may get nan or inf on internal dataset, which contains many samples shorter than 1 second, setting zero_infinity=True
in torch.nn.CTCLoss
looks normal, still waiting for final results.
Could you use
to save it to a
.pt
file?
Hi Fangjun, I upload the starting batch with ctc_loss=0 at https://drive.google.com/file/d/1kUJc12Jsi8OjVjeIOLINlArj9jjjKiuS/view?usp=sharing
./conformer_ctc/train.py doesn't have sp, hence I set it to None I also save the epoch 0 but need some time to upload. Maybe tomorrow. Thanks
Guys, I think this is not about the batch, I think this is an instability issue where there are NaN's or inf's appearing in the model. These will be filtered out by the CTC loss but not the other losses. I think the focus needs to be on the model parameters, not the batch.
Hi Dan,
I saved the model when the batch got ctc_loss=0 and upload at https://drive.google.com/file/d/1dqfxjWzkR-7j1H0gFuF946AbEHIrT7oC/view?usp=sharing if you want to have a look, big file about 1.2GB. I don't know how to look into the model for more info yet. Thank you!
Hi guys, I will redo the data part. I will remove utterances/segments which are less than 1 second (some still have good speech inside) and repeat the training to see if the error gone or not. I observe that some audio files although about 1 second but empty inside, I guess it's the error while recording, not sure if this cause the error or not. Thank you very much for your help!
you can load as a dict with torch.load(), and look at the keys and values, e.g. x['model'].keys() will give a list of the parameter names.
it's not so easy for me to download the file from here. Please look for inf's/nan's yourself.
All of the parameters are nan. After you get one nan, they tend to propagate. One way to debug it would be to run the training in pdb, and change the code to throw an exception if any part of the loss is nan. Then, from the debugger command line, see what is inf/nan and which parts of the model are inf/nan, which might help narrow it down. To make debugging easier, if it fails consistently at a certain position the code could be changed to write the checkpoint at a fixed batch number just before then, which would make it easier to reproduce the problem. It's been a while since we have looked at this recipe since we have mostly been focusing on RNN-T systems ("prunedtransducer*").
All of the parameters are nan. After you get one nan, they tend to propagate. One way to debug it would be to run the training in pdb, and change the code to throw an exception if any part of the loss is nan. Then, from the debugger command line, see what is inf/nan and which parts of the model are inf/nan, which might help narrow it down. To make debugging easier, if it fails consistently at a certain position the code could be changed to write the checkpoint at a fixed batch number just before then, which would make it easier to reproduce the problem. It's been a while since we have looked at this recipe since we have mostly been focusing on RNN-T systems ("prunedtransducer*").
Hi Daniel, thank you very much for your support! When encountered the first nan, I exit and save the model right away but it seems the nan still in. I have filtered out the segments which are less than 1 second and the training went good ( still have to filter out bad batches as above). I got quite good WER on reading speech (length usually about 2-7 seconds). However because we filtered out short segments (about 36,000 segments) which are still meaningful and belong to conversation dataset, the WER on conversational test set is worse than traditional Kaldi model.
I'm training another model on pruned_transducer_stateless_5 but I saw inside the code, it would also filter out less than 1 second segments. Have a good weekend!
Could you try #604?
Pruned RNN-T shares the same constraint as CTC training, so you can copy the filtering code from #604 .
log_prepare_stage14_maxDur500_1GPU_01.txt train.py.txt I modify Gigiaspeech recipe to train my 1200 hrs data (Including short utterances and long utterances with segmentations). I also encountered the invalid input size as at https://github.com/k2-fsa/icefall/issues/320 I apply AmirHussein96 solution to bypass bad batches. But after about 11,000 batches, I got ctc_loss=0, att_loss=nan, loss=nan problem. Training command: ./conformer_ctc/train.py --max-duration 500 --num-workers 1 --world-size 1 --exp-dir conformer_ctc/exp_500 --lang-dir data/lang_bpe_500
I attach the ./conformer_ctc/train.py and log file for reference. Thank you very much for any help or suggestion
python -m k2.version