How to train with multiple datasets of different styles?

treya-lin commented 1 year ago

Hi I am really hoping that I can get some advice here. So, I am currently planning to train a model of a low resource language with my data, and the resources that I have gathered are:

a few hundred hours of read speech corpus,
a few hundred hours of conversational speech corpus,
1k+ hours of open-source corpus from various sources and across various domains like audiobook, ted talk, and mostly command control.

So, previously with kaldi, I usually trained the GMM align model firstly with data that are tidier and easier, like the read speech and command control, for a few rounds, and finally I would add the more difficult data like conversational and public speech in the final round of gmm training. After that I will mix all the data, do some augmentation and then use them all to train the chain model.

And when it comes to end-to-end, I am a bit not sure how to utilize datasets of different styles. I read this discussion https://github.com/lhotse-speech/lhotse/issues/554 and understood that you guys have done some work using gigaspeech and librispeech, but since gigaspeech is much larger than librispeech, I guess the situation is a bit different here?

I agree with what Dan said about the effect of different normalization rules. So to start with my first model, I think I will skip the public corpus and keep things simple, so I plan to only use the read speech corpus and and the conversational corpus to train a model hoping to get a workable performance on both read and conv domains. The duration of these two styles of data are quite similar, about 300-400 hours respectivelly. I have done preparing the manifests of these corpus, but I am not really sure how to proceed.

So is there any recipe that might be used as guide or do you have any advice on how to organize and utilize different styles of data for training? Any insight would be greatly appreciated. Thanks!

yfyeung commented 1 year ago

@treya-lin Maybe you can follow this PR https://github.com/k2-fsa/icefall/pull/1020 (LibriSpeech + GigaSpeech + CommonVoice).

Prepare the different datasets respectively by Lhotse.
Add a cut preparation recipe mutidataset.py, and use CutSet.mux(librispeech_cuts, gigaspeech_cuts, commonvoice_cuts, weights=[len(librispeech_cuts), len(gigaspeech_cuts), len(commonvoice_cuts)]) to get the training CutSet.
Load the CutSet to Dataloader in asrdatamodule.py.
Small modify to train.py to support multidataset training.

danpovey commented 1 year ago

Yes, I don't think you need to separate out easy vs. hard data any more.

treya-lin commented 1 year ago

@treya-lin Maybe you can follow this PR #1020 (LibriSpeech + GigaSpeech + CommonVoice).

Prepare the different datasets respectively by Lhotse.

Add a cut preparation recipe mutidataset.py, and use CutSet.mux(librispeech_cuts, gigaspeech_cuts, commonvoice_cuts, weights=[len(librispeech_cuts), len(gigaspeech_cuts), len(commonvoice_cuts)]) to get the training CutSet.

Load the CutSet to Dataloader in asrdatamodule.py.

Small modify to train.py to support multidataset training.

Oh I see! thanks!

treya-lin commented 1 year ago

Yes, I don't think you need to separate out easy vs. hard data any more.

I see! Thanks!

k2-fsa / icefall

How to train with multiple datasets of different styles? #1035