Zipformer2 with CTC is hard to train

joazoa commented 10 months ago

I am playing a bit with the CTC option in the zipformer 2, with the largest model from the documentation. It trained well for a first dataset but when I try another dataset, the training stops.

I have tried reducing the LR, increasing the warmup period, disabling the FP16. changing the max duration, removing any augmentations in lhotse, reducing maximum file duration, removing specaugment and musan and changing the worldsize.

The same dataset works fine with zipformer2 tranducer, does not work for zipformer2 with only CTC. Works fine for zipformer-ctc as well.

Do you have any suggestions on what I could try next ?

2023-10-29 23:40:46,833 INFO [train.py:1034] (2/8) Epoch 1, batch 0, loss[loss=8.228, simple_loss=7.432, pruned_loss=7.424, ctc_loss=5.236, over 20568.00 frames. ], tot_loss[loss=8.228, simple_loss=7.432, pruned_loss=7.424, ctc_loss=5.236, over 20568.00 frames. ], batch size: 95, lr: 2.25e-02, grad_scale: 1.0 2023-10-29 23:40:46,833 INFO [train.py:1057] (2/8) Computing validation loss 2023-10-29 23:40:54,294 INFO [train.py:1066] (2/8) Epoch 1, validation: loss=inf, simple_loss=7.453, pruned_loss=7.416, ctc_loss=inf, over 901281.00 frames. 2023-10-29 23:40:54,295 INFO [train.py:1067] (2/8) Maximum memory allocated so far is 22148MB 2023-10-29 23:40:59,516 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten2, num_groups=1, num_channels=256, metric=7.26 vs. limit=5.0 2023-10-29 23:41:11,380 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.bypass.scale_min, batch_count=0.0, ans=0.9 2023-10-29 23:41:16,030 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module1.balancer2.min_abs, batch_count=106.66666666666667, ans=0.2016 2023-10-29 23:41:26,737 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.4.encoder.layers.2.feed_forward1.out_proj.dropout_p, batch_count=106.66666666666667, ans=0.29893333333333333 2023-10-29 23:41:40,934 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=192, metric=7.25 vs. limit=7.54 2023-10-29 23:41:50,438 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.conv_module1.whiten, num_groups=1, num_channels=512, metric=315.95 vs. limit=7.58 2023-10-29 23:42:01,307 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.0.whiten, num_groups=1, num_channels=768, metric=17.16 vs. limit=4.085333333333334 2023-10-29 23:42:09,032 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=37.41 vs. limit=7.62 2023-10-29 23:42:19,442 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=384, metric=165.76 vs. limit=5.08 2023-10-29 23:42:29,662 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=31.92 vs. limit=7.62 2023-10-29 23:42:40,221 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.feed_forward1.out_whiten, num_groups=1, num_channels=512, metric=433.28 vs. limit=7.66 2023-10-29 23:42:45,693 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=768, metric=595.94 vs. limit=7.82 2023-10-29 23:43:02,777 INFO [train.py:1034] (2/8) Epoch 1, batch 50, loss[loss=3.141, simple_loss=2.879, pruned_loss=2.03, ctc_loss=4.85, over 19767.00 frames. ], tot_loss[loss=inf, simple_loss=4.825, pruned_loss=4.687, ctc_loss=inf, over 918170.33 frames. ], batch size: 274, lr: 2.48e-02, grad_scale: 4.76837158203125e-07 2023-10-29 23:43:04,688 INFO [scaling.py:979] (2/8) Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=191.49 vs. limit=4.1066666666666665 2023-10-29 23:43:10,266 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.0.conv_module1.whiten, num_groups=1, num_channels=192, metric=48.81 vs. limit=7.7 2023-10-29 23:43:13,223 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=144, metric=4.29 vs. limit=5.133333333333334 2023-10-29 23:43:23,039 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.2.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=31.32 vs. limit=7.9 2023-10-29 23:43:41,183 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=6.31 vs. limit=4.256 2023-10-29 23:43:41,680 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=768, metric=90.25 vs. limit=7.98 2023-10-29 23:43:43,362 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.nonlin_attention.balancer.min_positive, batch_count=640.0, ans=0.2436 2023-10-29 23:43:46,978 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward3.out_whiten, num_groups=1, num_channels=768, metric=236.17 vs. limit=7.74 2023-10-29 23:43:53,663 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.0.layers.0.nonlin_attention.whiten2, num_groups=1, num_channels=192, metric=22.56 vs. limit=5.32 2023-10-29 23:44:09,088 INFO [scaling.py:199] (2/8) ScheduledFloat: name=encoder.encoders.2.encoder.layers.3.whiten.whitening_limit, batch_count=746.6666666666666, ans=4.298666666666667 2023-10-29 23:44:13,685 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=512, metric=355.68 vs. limit=8.06 2023-10-29 23:44:21,592 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.1.conv_module2.whiten, num_groups=1, num_channels=256, metric=175.08 vs. limit=7.82 2023-10-29 23:44:30,091 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=24.64 vs. limit=7.82 2023-10-29 23:44:30,502 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=768, metric=124.32 vs. limit=8.14 2023-10-29 23:44:44,328 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.2.encoder.layers.0.conv_module2.whiten, num_groups=1, num_channels=512, metric=141.67 vs. limit=7.82 2023-10-29 23:44:49,658 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.0.nonlin_attention.whiten1, num_groups=1, num_channels=576, metric=48.55 vs. limit=5.24 2023-10-29 23:44:51,541 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.1.nonlin_attention.whiten1, num_groups=1, num_channels=192, metric=7.75 vs. limit=5.24 2023-10-29 23:44:51,565 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=39.07 vs. limit=7.86 2023-10-29 23:44:54,528 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.4.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=80.86 vs. limit=7.86 2023-10-29 23:45:02,881 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.5.encoder.layers.0.whiten, num_groups=1, num_channels=256, metric=4.98 vs. limit=4.384 2023-10-29 23:45:06,051 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.3.encoder.layers.2.whiten, num_groups=1, num_channels=768, metric=15.26 vs. limit=4.384 2023-10-29 23:45:17,726 INFO [checkpoint.py:75] (2/8) Saving checkpoint to zipformer/exp-large-ctc-transducer/bad-model-first-warning-2.pt

danpovey commented 10 months ago

The fact that the validation loss is inf right from the start suggests to me that you may have utterances that have more symbols in their transcript than they have frames; is that possible? I don't recall whether we check for that somehow. Also, that run seems to have a grad_scale in it. That is definitely a potential issue. Normally the grad_scale is used with fp16, it should be disabled if not using it. Possibly you have inf's due to "bad" training utterances (too many symbols) and these ruin the grad-scaling logic.

joazoa commented 10 months ago

Thank you! I will try to check that. When using fp16, I see grad scaling and it complains about too small scale when saving the files.

In the mean time I found that for now it looks like worldsize 1 and --max-duration 400 works. If I increase either one, it dies with the errors above. Would that mean that it's probably not the symbols ?

This is my current status: 2023-10-30 12:38:11,329 INFO [train.py:1034] Epoch 1, batch 34250, loss[loss=0.5268, simple_loss=0.4807, pruned_loss=0.2082, ctc_loss=0.3909, over 9502.00 frames. ], tot_loss[loss=inf, simple_loss=0.449, pruned_loss=0.1802, ctc_loss=inf, over 1896721.50 frames. ], batch size: 125, lr: 2.08e-02, grad_scale: 2.0

danpovey commented 10 months ago

If you are not using fp16, make sure the grad-scaling code is deactivated. What might be happening is that if the batch-size (over all workers) is too large, the frequency of having "bad" utterances with num-symbols > num-frames is larger than the "decay period" of the grad-scale, so grad-scale approaches a very small value. But grad-scaling is not needed if fp16 is False. Also, there should be code somewhere which checks that the num-symbols is not more than the num-frames, and discards those utterances. E.g. in data loader code.

csukuangfj commented 10 months ago

~~I just checked the code and find that we don't have such a check in this specific train.py. Zengrui is helping fix it.~~

EDITED: sorry, I was wrong.

joazoa commented 10 months ago

I have the remove_short_and_long_utt function in train.py, but i don't have that warning in the log files.

armusc commented 10 months ago

I have the remove_short_and_long_utt function in train.py, but i don't have that warning in the log files.

look at this also https://github.com/k2-fsa/icefall/issues/1182#issuecomment-1742203390

joazoa commented 10 months ago

Thanks @armusc ! I will try additional filtering on the lengths.

danpovey commented 10 months ago

it's the (num-symbols / length) that would be the issue for this scenario. would make sense to check before computing the ctc loss.

joazoa commented 10 months ago

@armusc I tried it, nothing is getting filtered.

armusc commented 10 months ago

@armusc I tried it, nothing is getting filtered.

I mean, even better, you are not losing any data, I guess the condition T = ((c.num_frames - 7) // 2 + 1) // 2 should be checked already in remove_short_and_long_utt right? are you still having issues with your training or does it advance well?

joazoa commented 10 months ago

@armusc Yes, that check is there, also doesn't filter anything. The training only works with smaller batch size on a single gpu so far, as soon as i change that, it breaks sooner or later.

joazoa commented 10 months ago

2023-10-30 18:43:56,867 INFO [scaling.py:979] (5/8) Whitening: name=encoder.encoders.1.encoder.layers.0.self_attn2.whiten, num_groups=1, num_channels=256, metric=13.77 vs. limit=8.22 2023-10-30 18:43:59,141 INFO [scaling.py:979] (2/8) Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=nan vs. limit=7.86

This happens right before the problem occurs

armusc commented 10 months ago

I launched a training of the large model, and I am having indeed the same issue I am using 3 GPUs (most I have) with a max batch duration of 400 seconds (with 500 tokens anf 24 GB of memory can do no more than that ), same parameters of the 148M model and in the middle of the 3rd epoch (around batch 5500/13500) 2023-10-31 00:59:22,283 INFO [scaling.py:979] (2/3) Whitening: name=encoder.encoders.3.encoder.layers.4.feed_forward1.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-10-31 00:59:23,193 INFO [scaling.py:979] (0/3) Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0

before, the training looked to be converging as expected (i.e. loss values pretty much comparable to completed training of smaller models)

danpovey commented 10 months ago

The message:

name=encoder.encoders.1.encoder.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=nan vs. limit=7.86

will indicate that there are NaN's appearing in the forward() function. This could be related to fp16 or amp training; I assume you must be uysing this because there are messages about grad_scale in the previous message you posted. I suspect the original problem, before this, was that you started getting bad gradients because of grad scale getting too small; and grad scale was getting too small probably because you had too-long transcripts for utterances and this generated inf's in the loss, eventually generating nan grads by some mechanism. Running with --inf-check=True might generate some logs that would shed more light on the original problem.
The place I would start is, add an assertion that the CTC loss is not inf, and if it ever is inf, debug why. BTW, you cannot compute CTC loss in half precision, if using amp you need to guard with an appropriate enabled=False call with amp (search the code).

joazoa commented 10 months ago

@danpovey this last run where the metric is nan was without fp16. i will try the inf-check=true and the assertion.

armusc commented 10 months ago

I never have inf in the ctc_loss but rather nan, I have inf in the other losses 2023-10-30 02:28:11,664 INFO [train.py:1068] (2/3) Epoch 3, batch 5550, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9353.00 frames. ], tot_loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 1820564.14 frames. ], batch size: 51, lr: 1.97e-02,

that above is the first batch where I can see that

joazoa commented 10 months ago

This is the output with inf-check=true

in compute_loss
simple_loss, pruned_loss, ctc_loss = model( ...

module.output[2] is not finite: (tensor(23070.8750, device='cuda:7', grad_fn=), tensor(29302.8672, device='cuda:7', grad_fn=), tensor(inf, device='cuda: 7', grad_fn=))

joazoa commented 10 months ago

I added some more debugging, at some point there is one ctc loss that is inf. the next log for the batch will have tot_loss ctc inf.

The ctc_loss is inf because of encoder output: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,

x_covarsq_mean_diag: tensor(nan, device='cuda:0', grad_fn=) x_covar_mean_diag: tensor(nan, device='cuda:0', grad_fn=) Metric is NaN! x_covar: tensor([[[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]]], device='cuda:0', grad_fn=)

When i resume from a checkpoint, i suppose it starts from batch 0 again ? It will break after the same number of batches it stopped last time. (it doesn't crash immediately, at this nan, but keeps going.

I also see the losses slowly go up after resuming from the checkpoint. The checkpoint is when i stopped it manually after 200.000+ batches that trained properly on the same card, with the same settings.

Can you tell me what to look for next ?

joazoa commented 10 months ago

2023-11-01 01:37:03,879 INFO [train.py:1034] Epoch 1, batch 1550, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9476.00 frames. ], tot_loss[loss=nan, simple_loss=i nf, pruned_loss=inf, ctc_loss=nan, over 1903295.87 frames. ], batch size: 107, lr: 8.15e-03, 2023-11-01 01:37:04,453 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=152500.0, ans=0.125 2023-11-01 01:37:05,277 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=152500.0, ans=0.125 2023-11-01 01:37:08,821 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=152506.66666666666, ans=0.125 2023-11-01 01:37:11,755 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=22.5 2023-11-01 01:37:11,771 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:14,415 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:14,925 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:15,302 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=152513.33333333334, ans=15.0 2023-11-01 01:37:16,675 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:17,023 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=152513.33333333334, ans=0.0 2023-11-01 01:37:17,171 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:20,021 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:20,036 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:20,977 INFO [scaling.py:993] Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=nan vs. limit=15.0 2023-11-01 01:37:21,415 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:22,689 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:22,829 INFO [scaling.py:993] Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=nan vs. limit=22.5 2023-11-01 01:37:25,658 INFO [optim.py:471] Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.889e+02 nan nan nan, threshold=nan, percent-clipped=0.0 2023-11-01 01:37:26,970 INFO [checkpoint.py:75] Saving checkpoint to zipformer/exp-large-ctc-transducer3/checkpoint-228800.pt 2023-11-01 01:37:30,664 WARNING [checkpoint.py:299] Invalid checkpoint filename zipformer/exp-large-ctc-transducer3/checkpoint-226000_broken.pt 2023-11-01 01:37:31,178 INFO [train.py:1034] Epoch 1, batch 1600, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9765.00 frames. ], tot_loss[loss=nan, simple_loss=i nf, pruned_loss=inf, ctc_loss=nan, over 1905160.45 frames. ], batch size: 42, lr: 8.15e-03, 2023-11-01 01:37:31,254 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152533.33333333334, ans=0.1 2023-11-01 01:37:32,779 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:33,521 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152533.33333333334, ans=0.125 2023-11-01 01:37:34,587 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.3.whiten.whitening_limit, batch_count=152533.33333333334, ans=12.0 2023-11-01 01:37:35,613 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=576, metric=nan vs. limit=10.0 2023-11-01 01:37:38,542 INFO [scaling.py:993] Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=nan vs. limit=15.0 2023-11-01 01:37:39,234 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=152540.0, ans=0.125 2023-11-01 01:37:40,278 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=152546.66666666666, ans=0.04949747468305833 2023-11-01 01:37:47,997 INFO [scaling.py:993] Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=nan vs. limit=6.0 2023-11-01 01:37:50,992 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152560.0, ans=0.125 2023-11-01 01:37:51,090 INFO [scaling.py:993] Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=nan vs. limit=22.5 2023-11-01 01:37:52,592 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=22.5 2023-11-01 01:37:55,143 INFO [train.py:1034] Epoch 1, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9609.00 frames. ], tot_loss[loss=nan, simple_loss=i nf, pruned_loss=inf, ctc_loss=nan, over 1905201.41 frames. ], batch size: 39, lr: 8.14e-03, 2023-11-01 01:37:55,406 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:56,932 INFO [scaling.py:993] Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=nan vs. limit=8.0 2023-11-01 01:37:58,759 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:59,220 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:38:00,806 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152573.33333333334, ans=0.1 2023-11-01 01:38:02,757 INFO [scaling.py:1084] WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=nan 2023-11-01 01:38:03,656 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=152573.33333333334, ans=0.0

2023-11-01 01:36:42,503 INFO [train.py:1034] Epoch 1, batch 1500, loss[loss=0.2885, simple_loss=0.3269, pruned_loss=0.09162, ctc_loss=0.1671, over 9095.00 frames. ], tot_loss[loss=0.331 2, simple_loss=0.3636, pruned_loss=0.1066, ctc_loss=0.2142, over 1904292.31 frames. ], batch size: 22, lr: 8.15e-03, 2023-11-01 01:36:43,457 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.3.encoder.layers.3.feed_forward1.out_proj.dropout_p, batch_count=152466.66666666666, ans=0.1 2023-11-01 01:36:51,071 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.3.encoder.layers.1.balancer1.prob, batch_count=152480.0, ans=0.125 2023-11-01 01:36:52,868 INFO [scaling.py:993] Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=12.74 vs. limit=22.5 2023-11-01 01:36:54,851 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.bypass.scale_min, batch_count=152486.66666666666, ans=0.2 2023-11-01 01:36:55,023 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.4.self_attn1.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=22.5 2023-11-01 01:36:55,273 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.nonlin_attention.balancer.prob, batch_count=152486.66666666666, ans=0.125 2023-11-01 01:36:56,221 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.bypass_mid.scale_min, batch_count=152486.66666666666, ans=0.2 2023-11-01 01:36:57,016 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.conv_module1.balancer1.prob, batch_count=152486.66666666666, ans=0.125 2023-11-01 01:36:57,647 INFO [scaling.py:993] Whitening: name=encoder.encoders.0.layers.1.feed_forward1.out_whiten, num_groups=1, num_channels=192, metric=nan vs. limit=15.0 2023-11-01 01:37:00,144 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.2.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:03,003 INFO [scaling.py:203] ScheduledFloat: name=encoder_embed.convnext.out_balancer.prob, batch_count=152493.33333333334, ans=0.125 2023-11-01 01:37:03,879 INFO [train.py:1034] Epoch 1, batch 1550, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9476.00 frames. ], tot_loss[loss=nan, simple_loss=i nf, pruned_loss=inf, ctc_loss=nan, over 1903295.87 frames. ], batch size: 107, lr: 8.15e-03, 2023-11-01 01:37:04,453 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.3.feed_forward2.hidden_balancer.prob, batch_count=152500.0, ans=0.125 2023-11-01 01:37:05,277 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.2.conv_module2.balancer1.prob, batch_count=152500.0, ans=0.125 2023-11-01 01:37:08,821 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.0.layers.1.conv_module1.balancer1.prob, batch_count=152506.66666666666, ans=0.125 2023-11-01 01:37:11,755 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=22.5 2023-11-01 01:37:11,771 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.2.feed_forward1.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:14,415 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:14,925 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.conv_module2.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:15,302 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.5.encoder.layers.1.feed_forward3.out_whiten.whitening_limit, batch_count=152513.33333333334, ans=15.0 2023-11-01 01:37:16,675 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:17,023 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.0.ff2_skip_rate, batch_count=152513.33333333334, ans=0.0 2023-11-01 01:37:17,171 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.1.feed_forward2.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:20,021 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.3.self_attn1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:20,036 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:20,977 INFO [scaling.py:993] Whitening: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_whiten, num_groups=1, num_channels=256, metric=nan vs. limit=15.0 2023-11-01 01:37:21,415 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.2.self_attn1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:22,689 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.2.self_attn2.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=22.5 2023-11-01 01:37:22,829 INFO [scaling.py:993] Whitening: name=encoder.encoders.0.layers.0.self_attn2.whiten, num_groups=1, num_channels=192, metric=nan vs. limit=22.5 2023-11-01 01:37:25,658 INFO [optim.py:471] Clipping_scale=2.0, grad-norm quartiles 2.203e+02 3.889e+02 nan nan nan, threshold=nan, percent-clipped=0.0 2023-11-01 01:37:26,970 INFO [checkpoint.py:75] Saving checkpoint to zipformer/exp-large-ctc-transducer3/checkpoint-228800.pt 2023-11-01 01:37:30,664 WARNING [checkpoint.py:299] Invalid checkpoint filename zipformer/exp-large-ctc-transducer3/checkpoint-226000_broken.pt 2023-11-01 01:37:31,178 INFO [train.py:1034] Epoch 1, batch 1600, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9765.00 frames. ], tot_loss[loss=nan, simple_loss=i nf, pruned_loss=inf, ctc_loss=nan, over 1905160.45 frames. ], batch size: 42, lr: 8.15e-03, 2023-11-01 01:37:31,254 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152533.33333333334, ans=0.1 2023-11-01 01:37:32,779 INFO [scaling.py:993] Whitening: name=encoder.encoders.4.encoder.layers.3.feed_forward2.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:33,521 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward3.hidden_balancer.prob, batch_count=152533.33333333334, ans=0.125 2023-11-01 01:37:34,587 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.4.encoder.layers.3.whiten.whitening_limit, batch_count=152533.33333333334, ans=12.0 2023-11-01 01:37:35,613 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.nonlin_attention.whiten1, num_groups=1, num_channels=576, metric=nan vs. limit=10.0 2023-11-01 01:37:38,542 INFO [scaling.py:993] Whitening: name=encoder.encoders.1.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=256, metric=nan vs. limit=15.0 2023-11-01 01:37:39,234 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.2.encoder.layers.3.nonlin_attention.balancer.prob, batch_count=152540.0, ans=0.125 2023-11-01 01:37:40,278 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.5.encoder.layers.0.bypass.skip_rate, batch_count=152546.66666666666, ans=0.04949747468305833 2023-11-01 01:37:47,997 INFO [scaling.py:993] Whitening: name=encoder.encoders.1.encoder.layers.1.self_attn_weights.whiten_keys, num_groups=4, num_channels=128, metric=nan vs. limit=6.0 2023-11-01 01:37:50,992 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.3.encoder.layers.0.feed_forward3.hidden_balancer.prob, batch_count=152560.0, ans=0.125 2023-11-01 01:37:51,090 INFO [scaling.py:993] Whitening: name=encoder.encoders.5.encoder.layers.1.self_attn2.whiten, num_groups=1, num_channels=256, metric=nan vs. limit=22.5 2023-11-01 01:37:52,592 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.self_attn2.whiten, num_groups=1, num_channels=768, metric=nan vs. limit=22.5 2023-11-01 01:37:55,143 INFO [train.py:1034] Epoch 1, batch 1650, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9609.00 frames. ], tot_loss[loss=nan, simple_loss=i nf, pruned_loss=inf, ctc_loss=nan, over 1905201.41 frames. ], batch size: 39, lr: 8.14e-03, 2023-11-01 01:37:55,406 INFO [scaling.py:993] Whitening: name=encoder.encoders.3.encoder.layers.3.feed_forward3.out_whiten, num_groups=1, num_channels=768, metric=nan vs. limit=15.0 2023-11-01 01:37:56,932 INFO [scaling.py:993] Whitening: name=encoder_embed.out_whiten, num_groups=1, num_channels=192, metric=nan vs. limit=8.0 2023-11-01 01:37:58,759 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.2.conv_module1.whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:37:59,220 INFO [scaling.py:993] Whitening: name=encoder.encoders.2.encoder.layers.1.feed_forward3.out_whiten, num_groups=1, num_channels=512, metric=nan vs. limit=15.0 2023-11-01 01:38:00,806 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.feed_forward1.out_proj.dropout_p, batch_count=152573.33333333334, ans=0.1 2023-11-01 01:38:02,757 INFO [scaling.py:1084] WithLoss: name=encoder.encoders.4.encoder.layers.0.self_attn_weights, loss-sum=nan 2023-11-01 01:38:03,656 INFO [scaling.py:203] ScheduledFloat: name=encoder.encoders.1.encoder.layers.0.conv_skip_rate, batch_count=152573.33333333334, ans=0.0 Some symbol sequences are longer than their corresponding frame sequences! tensor([ 4, 13, 23, 17, 20, 23, 16, 30, 13, 16, 14, 12, 10, 17, 15, 10, 20, 26, 14, 15, 17, 18, 7, 20, 19, 12, 28, 26, 19, 20, 8, 10, 16, 10, 18, 14, 26, 11, 16, 15, 36, 21, 28, 17, 12, 16, 11, 4, 7, 19, 18, 15, 12, 22, 17, 9, 19, 13, 7, 15, 11, 15, 16, 18, 13, 11, 10, 6, 4, 11, 18, 13, 15, 14, 19, 13, 6, 10, 12, 6, 15, 13, 19, 13, 16, 17, 18, 11, 11, 14, 14, 13, 12, 14, 21, 12, 19, 14, 10, 10, 15, 11, 17, 11, 51, 20, 15, 3, 15, 14, 10, 18, 21, 4, 7, 15, 9, 8, 15, 13, 5, 10, 14, 22, 10, 15, 13, 14, 17, 11, 21, 12, 5, 10, 13, 16, 14, 17, 15, 12, 21, 6, 8, 13, 11, 12, 10, 9, 15, 15, 12, 19, 12, 7, 6, 17, 15, 17, 16, 10, 13, 14, 5, 5, 26, 14, 13, 25, 17, 20, 15, 14, 12], dtype=torch.int32) tensor(True)

the last print is from if (y_lens > encoder_out_lens.cpu()).any():
print("Some symbol sequences are longer than their corresponding frame sequences!") print(y_lens) print(encoder_out_lens.cpu().any())

desh2608 commented 10 months ago

This suggests that there are some utterances where the label sequence is longer than the frame sequence. You earlier mentioned that you have this filter in place but it does not filter anything? You should check whether there is some issue in applying the filter or elsewhere.

danpovey commented 10 months ago

This

ValueError: The sum of module.output[2] is not finite: (tensor(23070.8750, device='cuda:7', grad_fn=), tensor(29302.8672, device='cuda:7', grad_fn=), tensor(inf, device='cuda: 7', grad_fn=))"

makes me think that you might be able to help the issue by updating icefall. For some time now, this has been a warning, not an error. Over time we have resolved issues that might lead to infinities. It's better if you use the latest zipformer, the one in the zipformer/ directory.

joazoa commented 10 months ago

Thank you for the help! @desh2608 the error about the label sentence is longer than the frame sequence happens 100 batches after everything is already nan or inf. No such warnings are happening earlier.

@danpovey The icefall checkout is from last week, this is the one that hits (I added some extra comments to make sure it was using the proper file)

    def forward_hook(_module, _input, _output, _name=name):
        if isinstance(_output, Tensor):
            if not torch.isfinite(_output.to(torch.float32).sum()):
                raise ValueError(
                    f"The sum of {_name}.output is not finite: {_output}"
                )
        elif isinstance(_output, tuple):
            for i, o in enumerate(_output):
                if isinstance(o, tuple):
                    o = o[0]
                if not isinstance(o, Tensor):
                    continue
                if not torch.isfinite(o.to(torch.float32).sum()):
                    raise ValueError(   -> THIS ONE
                        f"The sum of {_name}.output[{i}] is not finite: {_output}"
                    )

this is in the forward_hook, the backward_hook is the one that was changed to warning in this commit: https://github.com/k2-fsa/icefall/commit/60317120caef28a43ed904a26263c57536ca95ab

danpovey commented 10 months ago

I can't find now where you showed us the ValueError message about the nan or inf. Didn't it have a stack trace printed also?

danpovey commented 10 months ago

BTW, we should figure out why exactly the inf's are being generated (e.g. it might be utterances with too many symbols, and that should be checked for), but if you apply #1359 it might prevent it from ruining the model; it changes the gradient-clipping logic in the optimizer to get rid of infinities and nan's in the gradients.

joazoa commented 10 months ago

@danpovey this is the output:

Traceback (most recent call last): File "./zipformer/train.py", line 1431, in main() File "./zipformer/train.py", line 1424, in main run(rank=0, world_size=1, args=args) File "./zipformer/train.py", line 1302, in run train_one_epoch( File "./zipformer/train.py", line 952, in train_one_epoch loss, loss_info = compute_loss( File "./zipformer/train.py", line 796, in compute_loss simple_loss, pruned_loss, ctc_loss = model( File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1547, in _call_impl hook_result = hook(self, args, result) File "/home/icefall/icefall/hooks.py", line 51, in forward_hook raise ValueError( ValueError: The sum of .output[2] is not finite: (tensor(4292.3564, device='cuda:0', grad_fn=), tensor(1690.2781, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=))

I will try the patch and let you know, thank you!!

joazoa commented 10 months ago

I applied the patch, i'm still getting this: 2023-11-01 13:58:32,175 INFO [train.py:1034] Epoch 1, batch 1550, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 9476.00 frames. ], tot_loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 1903295.87 frames. ], batch size: 107, lr: 8.15e-03,

It didn't crash yet, but i suppose it will eventually. I tried a new test from scratch on 8 gpu's, Same issue: 2023-11-01 14:09:26,282 INFO [train.py:1034] Epoch 1, batch 50, loss[loss=nan, simple_loss=inf, pruned_loss=inf, ctc_loss=nan, over 11128.00 frames. ], tot_loss[loss=nan, simple_loss=inf, pruned_loss=inf , ctc_loss=nan, over 457389.42 frames. ], batch size: 46, lr: 2.48e-02,

It doesn't crash, but does not train either. I will let it run for a bit to see if it dies eventually or recovers. (update, it doesn't crash. I see occasional messages about the symbols exceeding the length, but that is long after the metric becomes nan.

danpovey commented 10 months ago

If it is consistently giving nan's and inf's in the loss, then the model is corrupted. Run with the inf check and try to debug why the ctc loss is producing nan's. e.g. find out something about the utterance involved: its length, the symbol-sequence length, maybe the symbols.

If you do not filter out symbol sequences that are longer than the num-frames, it is expected to generate inf's in the CTC loss. If you have applied the optim.py patch and the train.py has "clipping_scale=2.0" passed into ScaledAdam, then I am a little confused why the optimizer is putting the nan's/inf's into the model. Note, it should print something like "Scaling gradients by 0.0... " when it gets inf/nan gradients and should otherwise continue fairly normally.

Also, if you are seeing these problems after resuming from a checkpoint it is possible that the inf's or nan's are already in the model. If so you would see inf's/nan's right from the start of the training after resuming. In that case it's a bad checkpoint, you cannot use it.

You could perhaps run right from the start with --inf-check=True. It is strange that you got inf's even in the first validation set that it tested- presumably with a fresh model. That might indicate an issue with data lengths. We need to debug right from the very first point that inf's or nan's appear, with the --inf-check=True. If you have a model that's trained for longer and eventually goes bad, the output of the --print-diagnostics=True option would be interesting to see as well, e.g. attach a file. Could do that on a trained model before it goes bad, to see if any values are getting way too large. Also let us know if you changed the training code.

joazoa commented 10 months ago

What I found so far: In train.py, I have: T = ((c.num_frames - 7) // 2 + 1) // 2 if T < len(tokens): -> this does not seem to catch any bad samples.

but when i add this in model.py for i, (y_len, encoder_out_len) in enumerate(zip(y_lens, encoder_out_lens.cpu())): if y_len > encoder_out_len: -> i still find samples where the label is longer than the frames, which causes the loss to be inf. If i remove the from the batch before calculating the ctc loss, it passes. but eventually the metric still becomes nan.

I will dig deeper tomorrow (it takes a while to get to the problematic batch, so progress is rather slow)

danpovey commented 10 months ago

OK, thanks! Let's persevere with finding the underlying problem. The --inf-check option does not make it very much slower, and it's probably a good idea to run with it to catch infinities at their source. There is something else which confuses me though, which is that if you are using the current egs/librispeech/ASR/zipformer/optim.py (and note that there may be different versions of optim.py in different directories), the infinities actually should not propagate into the model parameters, because they should be caught and zeroed out by the gradient clipping code, assuming that is activated, e.g. by the option clipping_scale=2.0.

Oh, wait... I see from your "quartiles" message that because over half the grads were nan, the median gradient was nan and the grad-clipping cutoff became nan. This is a case that we didn't consider. (Incidentally, this means that a lot of your data may have bad lengths, which could indicate some kind of problem, or maybe you need to subsample less, if they are speaking super fast.)

@zr_jin can you change the code to detect this and exit? E.g.: in optim.py, after: median = quartiles[2] you can add: if median - median != 0: raise RuntimeError("Too many grads were not finite")

so that if the middle quartile is nan, it goes back to the original

JinZr commented 10 months ago

hi @joazoa, please check if this PR works https://github.com/k2-fsa/icefall/pull/1366

thank you!

joazoa commented 10 months ago

Thanks! I will certainly give it a try by Monday.

In the mean time some more observations:

Any lhotse augmentation (speed, tempo, volume) seems to cause much more frequent issues.

Some batches have inf ctc loss but when I calculate the ctc loss for the individual samples, sometimes none of them have inf ctc loss.

In that case, if i hardcode the ctc_loss to be something fixed (like 500), the metric does not end up with nan and I can train successfully. (I ran it for 8 hours on 8 gpu's with large duration and it survived and trained).

@dan the -inf check is giving me some deprecation (iirc) warnings.( I don't have the logs with me while writing this text), they are flooding the terminal, do you happen to know how i can disable them ?

danpovey commented 10 months ago

Is it a streaming model? It's certainy possible that we have some bug with the padding that causes such problem. Also let us know if you have made any changes to the zipformer model code. Do a test with the torch CTCLoss to see whether, if you have an input that has infinities in the padding frames, it produces an infinite loss. It might be necessary to address this. We do expect to sometimes have infinities in the padding frames for the streaming model; but typically this would not matter.

joazoa commented 10 months ago

I'm using the train.py from the zipformer2 in librispeech eg, with the large values from the librispeech zipformer2 results document, but with ctc and transducer instead of only transducer. The only changes are related to the different dataset (I use a modified gigaspeech dataloader) and with debugging code related to this issue.

It's the offline model, no streaming. I will try to do the ctcloss test by monday. Thank you!

danpovey commented 10 months ago

... also, let us know what warning message the inf-check is giving.

joazoa commented 10 months ago

It seems to work, thank you! No more nan. I still check that the batch loss is not infinite, and if it is, i remove any samples where the ctc loss is inf (because of the symbol length versus frames.)

How could i filter those samples earlier ? The dataloader check doesn't seem to work for me.

KarelVesely84 commented 1 month ago

Hello, i am encountering the same problem, i started from the librispeech scripts, i have a recipe with custom data.

The inf. appears after batch 4000:

2024-08-09 15:52:20,722 WARNING [optim.py:488] (3/4) Clipping_scale=2.0, grad-norm quartiles 1.373e+02 1.872e+02 2.234e+02 3.157e+02 6.377e+02, threshold=4.469e+02, percent-clipped=6.0
2024-08-09 15:52:20,741 INFO [train.py:1174] (3/4) Epoch 1, batch 4000, loss[loss=0.4561, ctc_loss=0.4561, over 6714.00 frames. ], tot_loss[loss=0.5191, ctc_loss=0.5191, over 1406780.97 fra
2024-08-09 15:52:24,230 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.feed_forward1.out_proj.dropout_p, batch_count=8000.0, ans=0.22
2024-08-09 15:52:27,297 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder.encoders.0.layers.1.attention_skip_rate, batch_count=8000.0, ans=0.03333333333333334
2024-08-09 15:52:37,109 INFO [scaling.py:214] (3/4) ScheduledFloat: name=encoder_embed.convnext.hidden_balancer.prob, batch_count=8040.0, ans=0.125
2024-08-09 15:52:39,843 INFO [scaling.py:1024] (3/4) Whitening: name=encoder.encoders.4.encoder.layers.0.feed_forward2.out_whiten, num_groups=1, num_channels=256, metric=9.08 vs. limit=10.515
2024-08-09 15:52:47,429 INFO [scaling.py:1120] (3/4) WithLoss: name=encoder.encoders.2.encoder.layers.0.self_attn_weights, loss-sum=0.000e+00
...
2024-08-09 15:52:59,528 INFO [train.py:1110] (3/4) Caught exception: The sum of module.output[2] is not finite: (tensor([]), tensor([]), tensor(inf, device='cuda:3', grad_fn=<SumBackward0>), tensor([])).
2024-08-09 15:52:59,528 INFO [checkpoint.py:75] (3/4) Saving checkpoint to zipformer/exp_en-EU_streaming_small_ctc_fp32_lr004/bad-model-3.pt
2024-08-09 15:53:00,776 INFO [train.py:1487] (3/4) Saving batch to zipformer/exp_en-EU_streaming_small_ctc_fp32_lr004/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
2024-08-09 15:53:01,044 INFO [train.py:1493] (3/4) features shape: torch.Size([85, 353, 80])
2024-08-09 15:53:01,048 INFO [train.py:1497] (3/4) num tokens: 1350
W0809 15:54:24.128000 22379984619328 torch/multiprocessing/spawn.py:146] Terminating process 13664 via signal SIGTERM
W0809 15:54:24.129000 22379984619328 torch/multiprocessing/spawn.py:146] Terminating process 13665 via signal SIGTERM
W0809 15:54:24.129000 22379984619328 torch/multiprocessing/spawn.py:146] Terminating process 13666 via signal SIGTERM
...
ValueError: The sum of module.output[2] is not finite: (tensor([]), tensor([]), tensor(inf, device='cuda:3', grad_fn=<SumBackward0>), tensor([]))

It is CTC-only training, fp32, base_lr is an item from (0.04,0.004,0.0004).

The default data filtering rule is : if T < len(tokens): and in #1182 is mentioned filering rule if T * 2./3. < len(tokens):.

Could that be the source ? I.e. the edge case of T == len(tokens) not being filtered out ? (running now the if T * 2./3. < len(tokens): version of utterance filtering, to see what happens)

Best regards, Karel

danpovey commented 1 month ago

The batch was saved to disk so you may be able to inspect it to see if there was anything odd there. Remember that T gets subsampled at some point, and there are edge effects.

KarelVesely84 commented 1 month ago

Hi Dan, the inspection of batch has following output:

num_embeddings = tensor([87, 85, 85, 85, 84, 84, 82, 82, 82, 82, 81, 81, 81, 81, 81, 80, 80, 80,
        79, 79, 79, 79, 79, 79, 78, 78, 78, 78, 77, 77, 76, 76, 76, 76, 76, 75,
        75, 75, 73, 73, 71, 71, 71, 71, 70, 70, 69, 69, 69, 68, 68, 68, 67, 67,
        66, 65, 64, 64, 64, 62, 62, 61, 61, 60, 58, 57, 57, 55, 55, 54, 53, 49,
        49, 49, 45, 43, 42, 41, 37, 36, 34, 34, 33, 29, 27], dtype=torch.int32)
supervision_lengths = tensor([19., 22., 12., 15., 18., 18., 24., 21., 17., 13., 10.,  9., 16., 18.,
        14., 10., 23., 14., 21., 22., 22., 15., 19., 21., 18., 14., 20., 21.,
        23., 13., 14., 23., 13., 13., 20., 16., 20., 25.,  8., 13., 14.,  8.,
        12., 16., 15., 14., 52., 22., 22.,  9., 15., 18., 26., 11., 10., 14.,
        13., 16., 14.,  7., 14., 15., 16., 14., 11., 10., 18., 12., 11., 13.,
        15., 13., 29., 19., 10., 10., 12.,  4., 13.,  9., 33., 13.,  9.,  6.,
         6.])
num_embeddings - supervision_lengths = tensor([68., 63., 73., 70., 66., 66., 58., 61., 65., 69., 71., 72., 65., 63.,
        67., 70., 57., 66., 58., 57., 57., 64., 60., 58., 60., 64., 58., 57.,
        54., 64., 62., 53., 63., 63., 56., 59., 55., 50., 65., 60., 57., 63.,
        59., 55., 55., 56., 17., 47., 47., 59., 53., 50., 41., 56., 56., 51.,
        51., 48., 50., 55., 48., 46., 45., 46., 47., 47., 39., 43., 44., 41.,
        38., 36., 20., 30., 35., 33., 30., 37., 24., 27.,  **1.**, 21., 24., 23.,
        21.])

For some reason, for CTC there has to be no more than T - 2 supervision tokens per utterance. (tested empirically)

Does it also make sense theoretically ? (i.e. are there the <bos> <eos> symbols added to each supervision for CTC ?)

K.

joazoa commented 1 month ago

@KarelVesely84 I have vague memories that i managed to make it work by ignoring batches that lead to inf loss.

KarelVesely84 commented 1 month ago

@joazoa Yes, and I can prevent inf. loss value by filtering with if (T - 2) < len(tokens): rule... ;-) (empirically tested) Best, K.

joazoa commented 1 month ago

Great! (sent you an invite on linked in btw :)

KarelVesely84 commented 1 month ago

Created PR on that : https://github.com/k2-fsa/icefall/pull/1713

k2-fsa / icefall

Zipformer2 with CTC is hard to train #1352