tmramalho / finetune-mbart

How to finetune mbart using fairseq
21 stars 6 forks source link

mBART English-Japanese finetuning

It can be difficult to follow the original instructions on how to finetune mBART. This is what I did to finetune it for English-Japanese and Japanese-English translation.

Some of these packages may be outdated for the current version of fairseq. If you find an issue, contributions are welcome.

Install dependencies

I used a clean installation of python 3.7 as a start.

source nlp/bin/activate
pip install pytorch
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

sentencepiece is found in /usr/local/bin/spm_encode

Download the data (japanese)

For finetuning on Japanese we use wikimatrix and jpparacrawl.

wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-ja.tsv.gz
wget http://www.kecl.ntt.co.jp/icl/lirg/jparacrawl/release/2.0/bitext/en-ja.tar.gz

Download the checkpoint

wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz
tar -xzvf mbart.CC25.tar.gz

Preprocessing

Make sure that the lang names are en_XX and ja_XX on files, not en and ja.

Train

To train the reverse direction, you need to swap SRC and TGT languages. Just run:

sh train.sh

Evaluate

See load_checkpoint.py