Unsupervised Pre-training for Sequence to Sequence Learning

Abstract

Propose to pre-train the encoder and decoder of seq2seq model with the trained weights of two language models
An additional language modeling loss is used to regularize the model during fine-tuning
SOTA in WMT 2014 English->German

Basic Methodology
- Train LM_src, LM_tgt using monolingual data
- Use LM-pre-trained model's Embedding Layer, 1st LSTM Layer and Softmax Layer in Decoder
Improvement
- Add monolingual loss during training to preserve the feature extractor of LMs (one can freeze pre-trained weights for few epochs and train the whole weight later too)
- Residual connection to mitigate initial noise corrupting the pre-trained weights
- Multi-layer attention, extracting both low and high level contextual information
Result
- Outperforms Back-translation
Ablation Study
- Pre-training decoder is better, because decoder does more difficult job of keeping semantics and generating sentence with correct syntax
- Gains of pre-training greatly overlap
- LM objective serves as a big regularizer, it assures fluency for sure

Why only benchmark back-translation?
How can I train Language Model with internal monolingual data?
Authors did good and careful job of not corrupting pre-trained weights, knowing that LM weights provide good fluency features

Link : https://arxiv.org/pdf/1611.02683.pdf Authors : Ramachandran et al. 2017