How to do continuous pretraining for the 3rd iteration mHuBERT-147?

nervjack2 commented 2 months ago

Hi, Thank you for the excellent works! I would like to ask whether it is possible to do continuous pretraining for the 3rd iteration mHuBERT-147? Specifically, what should I add in this training script to do continuous pretraining? https://github.com/utter-project/fairseq/blob/main/examples/mHuBERT-147/scripts/run_train_2nd_iter.sh Thank you so much!

mzboito commented 2 months ago

Hello again :)

For continuous pretraining, it should work almost without any modification.

checkpoint Make sure to put the fairseq checkpoint from here into a folder, and give it as input argument in "checkpoint.save_dir" (you might need to rename the checkpoint as checkpoint_last.pt so that it is automatically loaded, I'm not sure)
data I'm assuming you are using different data from the mHuBERT-147 training. On this case, you need to checkpoint.reset_dataloader=True Also notice that hubert training requires a "dict.km.txt"
optimizer You might want to reset your optimizer with checkpoint.reset_optimizer=True
config file (https://github.com/utter-project/fairseq/blob/main/examples/mHuBERT-147/config/pretrain/mhubert147_base.yaml) Change fp16 to fp32. Resetting your optimizer, you go back to 0 steps. Put your amount of training steps as value in the optimization section.

ajesujoba commented 1 month ago

@mzboito, thanks. Can you also share the exact script you used for converting the fairseq checkpoint to HF?

mzboito commented 3 weeks ago

You need to copy the json config file from the huggingface repo: https://huggingface.co/utter-project/mHuBERT-147/blob/main/config.json Make sure the settings actually correspond to your model (e.g. fp16 vs fp32).

Then you can use this script here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/hubert/convert_hubert_original_pytorch_checkpoint_to_pytorch.py

I usually call it using:

python transformers/src/transformers/models/hubert/convert_hubert_original_pytorch_checkpoint_to_pytorch.py --pytorch_dump_folder_path <folder where the checkpoint is> --checkpoint_path <checkpoint file> --not_finetuned --config_path <the config.json>

utter-project / fairseq

How to do continuous pretraining for the 3rd iteration mHuBERT-147? #1