kenji-imamura / recycle_bert

A sample code of the paper published at NGMT-2019.
Other
8 stars 4 forks source link

Recycling a Pre-trained BERT Encoder for Neural Machine Translation (Imamura and Sumita, 2019)

This example replaces the Transformer-based fairseq encoder with a pre-trained BERT encoder. Based on the fairseq v0.9.0, the paper [https://www.aclweb.org/anthology/D19-5603/] is implemented.

Example usage

This example assumes the following data.

Requirements

This example is based on the fairseq and uses transformers for applying pre-trained BERT models. If you convert the BERT models from TensorFlow to PyTorch, the TensorFlow library is also required.

pip3 install fairseq
pip3 install transformers
pip3 install tensorflow

Directories / paths

#! /bin/bash
export BERT_MODEL=./uncased_L-12_H-768_A-12
export CODE=./user_code
export CORPUS=./corpus
export DATA=./data
export MODEL_STAGE1=./model.stage1
export MODEL_STAGE2=./model.stage2
export PYTHONPATH="$CODE:$PYTHONPATH"

Conversion

To use pre-trained BERT models for the TensorFlow library in the fairseq translator, they have to be converted into the models for the PyTorch library.

Tokenization

The source sides of corpora are tokenized and converted into sub-words using the BERT tokenizer.

cat $CORPUS/train.en \
    | python3 $CODE/bert_tokenize.py \
          --model=$BERT_MODEL > $CORPUS/train.bpe.en

Binarization

First, the vocabulary file in the BERT model is converted into that for the fairseq.

Then, the tokenized corpora are converted into binary data for the fairseq.

Training stage 1 (decoder training)

In the first stage of training, only the decoder is trained by freezing the BERT model.

mkdir -p $MODEL_STAGE1
fairseq-train $DATA -s en -t de \
    --user-dir $CODE --task translation_with_bert \
    --bert-model $BERT_MODEL \
    --arch transformer_with_pretrained_bert \
    --no-progress-bar --log-format simple \
    --log-interval 1800 \
    --max-tokens 5000 --update-freq 4 \
    --max-epoch 20 \
    --optimizer adam --lr 0.0004 --adam-betas '(0.9, 0.99)' \
    --label-smoothing 0.1 --clip-norm 5 \
    --dropout 0.15 \
    --min-lr '1e-09' --lr-scheduler inverse_sqrt \
    --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy \
    --warmup-updates 45000 --warmup-init-lr '1e-07' \
    --save-dir $MODEL_STAGE1

Training stage 2 (fine-tuning)

In the stage 2, the entire model including the BERT encoder is tuned.

mkdir -p $MODEL_STAGE2
fairseq-train $DATA -s en -t de \
    --user-dir $CODE --task translation_with_bert \
    --bert-model $BERT_MODEL \
    --arch transformer_with_pretrained_bert \
    --fine-tuning \
    --restore-file $MODEL_STAGE1/checkpoint_best.pt \
    --reset-lr-scheduler --reset-meters --reset-optimizer \
    --no-progress-bar --log-format simple \
    --log-interval 1800 \
    --max-tokens 5000 --update-freq 4 \
    --max-epoch 60 \
    --optimizer adam --lr 0.00008 --adam-betas '(0.9, 0.99)' \
    --label-smoothing 0.1 --clip-norm 5 \
    --dropout 0.15 \
    --min-lr '1e-09' --lr-scheduler inverse_sqrt \
    --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy \
    --warmup-updates 9000 --warmup-init-lr '1e-07' \
    --save-dir $MODEL_STAGE2

Evaluation

When you run fairseq-generate or fairseq-interactive, you must give --user-dir $CODE, --task translation_with_bert, and --bert-model $BERT_MODEL.

fairseq-generate $DATA -s en -t de \
    --user-dir $CODE --task translation_with_bert \
    --bert-model $BERT_MODEL \
    --no-progress-bar \
    --gen-subset valid \
    --path $MODEL_STAGE2/checkpoint_best.pt \
    --lenpen 1.0 \
    --beam 10 --batch-size 32

Citation

@inproceedings{imamura-sumita-2019-recycling,
  title     = "Recycling a Pre-trained {BERT} Encoder for Neural Machine Translation",
  author    = "Imamura, Kenji and Sumita, Eiichiro",
  booktitle = "Proceedings of the 3rd Workshop on Neural Generation and Translation",
  publisher = "Association for Computational Linguistics",
  pages     = "23--31",
  month     = November,
  year      = 2019,
  address   = "Hong Kong",
  url       = "https://www.aclweb.org/anthology/D19-5603/",
}

Acknowledgement

This work was supported by the "Research and Development of Enhanced Multilingual and Multipurpose Speech Translation Systems" a program of the Ministry of Internal Affairs and Communications, Japan.