Open adithyaur99 opened 3 years ago
Yes, you can do it, but you won't be able to leverage the pretrained model (training from scratch is computational expensive) If you want a larger model, my recommendation is to use the pretrain large model from https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_new.pt.
For pre-training: You need to point the --init_model arg to the large mode.\ Try to decrease the batch size to avoid OOM problem.\ replace these line:\
cmd.append("+optimization.update_freq='[" + str(int(64/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_base_librispeech")
by:
cmd.append("+optimization.update_freq='[" + str(int(128/NUM_GPU)) + "]'")
cmd.append("--config-name wav2vec2_large_librivox")
For fine-tuning:
Edit this line:
cmd.append("--config-name " + config_name)
replace the config_name variable with the exact config you want from conf/finetuning (eg. vox_100h, vox_10h, ...)
When I try to decrease the batch size, I end up with this error.
AssertionError: Sentences lengths should not exceed max_tokens=120000
As I've found out through testing, the max_tokens should not be less than 16000 * nr_of_seconds_of_your_largest_wav_file.
"Please ensure that the architectures match.".format(filename) Exception: Cannot load model parameters from checkpoint /content/self-supervised-speech-recognition/wav2vec_small_960h.pt; please ensure that the architectures match.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Can I pretrain the different versions of the wav2vec with the same code?