mailong25 / self-supervised-speech-recognition

speech to text with self-supervised learning based on wav2vec 2.0 framework
379 stars 114 forks source link

Pretraining larger models? #15

Open adithyaur99 opened 3 years ago

adithyaur99 commented 3 years ago

"Please ensure that the architectures match.".format(filename) Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Can I pretrain the different versions of the wav2vec with the same code?

mailong25 commented 3 years ago

Yes, you can do it, but you won't be able to leverage the pretrained model (training from scratch is computational expensive) If you want a larger model, my recommendation is to use the pretrain large model from https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt.

For pre-training: You need to point the --init_model arg to the large mode.\ Try to decrease the batch size to avoid OOM problem.\ replace these line:\

cmd.append("+optimization.update_freq='[" + str(int(64/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_base_librispeech")

by:

cmd.append("+optimization.update_freq='[" + str(int(128/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_large_librivox")

For fine-tuning: Edit this line: cmd.append("--config-name " + config_name) replace the config_name variable with the exact config you want from conf/finetuning (eg. vox_100h, vox_10h, ...)

adithyaur99 commented 3 years ago

When I try to decrease the batch size, I end up with this error.

AssertionError: Sentences lengths should not exceed max_tokens=120000

TaridaGeorge commented 3 years ago

As I've found out through testing, the max_tokens should not be less than 16000 * nr_of_seconds_of_your_largest_wav_file.