max_duration for zipformer with about 6800 tokens

1215thebqtic commented 1 year ago

Hi,

I'm doing 80k hours training using the default zipformer of wenetspeech recipe. The token size is about 6800 (Chinese char + English bpe) and GPU memory is 32G. The max_duration can only be set under 150, otherwise GPU will be out of memory. The default value in wenetspeech recipe is 650 and token size is about 5500.

I'm wondering is this normal? since the max_duration is far less than the default 650. Thanks!

csukuangfj commented 1 year ago

What's the distribution of your utterance length?

1215thebqtic commented 1 year ago

What's the distribution of your utterance length?

mean 2.2 std 1.5 min 0.1 25% 1.2 50% 1.8 75% 2.7 99% 7.6 99.5% 8.8 99.9% 14.6 during training, utterances <1s or >20s will be eliminated. In addition, if max_duration=150, the training exits after 8000 iterations because of GPU out of memory...

pzelasko commented 1 year ago

Since you have 80k hours you could probably discard everything >10s as it's less than 0.5% of your data. You might also tune the dynamic bucketing sampler settings: setting quadratic_duration to sth between 15-30 would let you increase max duration by 50-100%, also increasing num_buckets a bit can slightly help.

danpovey commented 1 year ago

I suggest that you make sure your k2 is up to date. We fixed a bug at some point that affects memory usage in cases where the vocab size is large. BTW in future I hope to make Chinese ASR systems with BPE that has --character_coverage=0.98 and --byte_fallback=True, so that small vocab sizes like 500 can be used.

1215thebqtic commented 1 year ago

Since you have 80k hours you could probably discard everything >10s as it's less than 0.5% of your data. You might also tune the dynamic bucketing sampler settings: setting quadratic_duration to sth between 15-30 would let you increase max duration by 50-100%, also increasing num_buckets a bit can slightly help.

Thanks! It works and I also upgrade k2 to the latest version. Now, the config is max_duration = 300, quadratic_duration = 20, discard everything >10s.

In addition, the GPU memory usage is imbalanced, what is the cause of this phenomenon? Is this normal? Thanks! k2-gpumem

pzelasko commented 1 year ago

I’d say keep tweaking max duration and maybe quadratic duration too until you squeeze out everything you can from the GPU memory.

armusc commented 11 months ago

Hi, I'm not sure this the right thread, but it still concerns the latest zipformer training, so...

I'm trying the zipformer2 training (let's say with the setup of librispeech) but the training stops very early in the warmup stage, because of nan, inf values

ValueError: The sum of module.output[2] is not finite: (tensor(2281.6624, device='cuda:0', grad_fn=), tensor(2331.6426, device='cuda:0', grad_fn=), tensor(inf, device='cuda:0', grad_fn=))

the code is up-to-date; I rule out problems in data preparations because the cuts are the same I use in zipformer training (i.e. pruned_stateless_transducer7_ctc) with no issue; I use the ctc loss with the transducer one (ctc loss = 0.2), but even with transducer only, the training diverges very early. I use 500 bpe tokens, but I can not go up to the batch durations shown in the recipes here, with 24 GB I use 200 seconds (with 300 I have already OOM); I have played with base learning rates but have not found anything that would make it work. Is it something that might depend on the batch duration, according to your experiments?

desh2608 commented 11 months ago

@armusc Can you show the output of lhotse cut describe <your-train-cuts>? Have you filtered out very long cuts? Also, I would check if there are some cuts with very large label sequences --- those can also cause nan/inf issues sometimes.

armusc commented 11 months ago

Cuts count: 3791756 Total duration (hh:mm:ss): 4204:18:35 Speech duration (hh:mm:ss): 4204:18:35 (100.5%) Duration statistics (seconds): mean 4.0 std 7.3 min -448.7 25% 1.1 50% 2.4 75% 5.2 99% 20.9 99.5% 28.1 99.9% 47.0 max 3728.8 Recordings available: 3791756 Features available: 3791756 Supervisions available: 3791412

right now, I can not make up what the negative number means as a min duration anyway I am filtering between 0.3 and 55.0 seconds yeahn I can definitely go down 55 secs, I'll try to see what happens, with the first zipformer I was not having issues

desh2608 commented 11 months ago

You can save the batch where the loss becomes NaN/inf, and see if it looks different. @DongjiGao recently found that this can happen if label sequence is quite long (for example, greater than 2/3rd the size of the input sequence).

armusc commented 11 months ago

the batch where it stopped contained short utterances (from 0.7 to 0.3 seconds) anyway, like you suggested, I removed all those utterances where the number of tokens is greater than 2/3rd the feature length after subsampling so far, so good maybe this empirical finding should be highlighted somewhere?

desh2608 commented 11 months ago

@armusc How much data do you have left after this filtering? I hope this heuristic did not result in throwing away a lot of data. Another way to reduce label sequence length would be to increase your vocabulary size (especially if you have punctuations, casing, etc.)

armusc commented 11 months ago

I threw away very little data, it's not even 0.5%

DongjiGao commented 11 months ago

@armusc The short utterances may contain empty text. For filtering, I referred to this script: https://github.com/k2-fsa/icefall/blob/109354b6b8199fa27cd8d4310b59a2e45da1d537/egs/librispeech/ASR/conformer_ctc2/train.py#L929

armusc commented 11 months ago

I don't have empty text for utterances, it'd probaly stop somewhere during data preparation, wouldn't it, if that were the case that heuristic in remove_invalid_utt_ctc leads to an enormous reduction of the corpus, from about 27k batches to just 5k for an epoch

k2-fsa / icefall

max_duration for zipformer with about 6800 tokens #1182