It can be difficult to follow the original instructions on how to finetune mBART. This is what I did to finetune it for English-Japanese and Japanese-English translation.
Some of these packages may be outdated for the current version of fairseq
. If you find an issue, contributions are welcome.
I used a clean installation of python 3.7 as a start.
source nlp/bin/activate
pip install pytorch
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
sentencepiece is found in
/usr/local/bin/spm_encode
For finetuning on Japanese we use wikimatrix and jpparacrawl.
wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ja.tsv.gz
wget http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/release/2.0/bitext/en-ja.tar.gz
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz
tar -xzvf mbart.CC25.tar.gz
python prepare_data.py
sh run_sentencepiece.sh
sh fairseq-preprocess.sh
Make sure that the lang names are en_XX
and ja_XX
on files, not en
and ja
.
To train the reverse direction, you need to swap SRC
and TGT
languages. Just run:
sh train.sh
See load_checkpoint.py