Propose to pre-train the encoder and decoder of seq2seq model with the trained weights of two language models
An additional language modeling loss is used to regularize the model during fine-tuning
SOTA in WMT 2014 English->German
Details
Basic Methodology
Train LM_src, LM_tgt using monolingual data
Use LM-pre-trained model's Embedding Layer, 1st LSTM Layer and Softmax Layer in Decoder
Improvement
Add monolingual loss during training to preserve the feature extractor of LMs (one can freeze pre-trained weights for few epochs and train the whole weight later too)
Residual connection to mitigate initial noise corrupting the pre-trained weights
Multi-layer attention, extracting both low and high level contextual information
Result
Outperforms Back-translation
Ablation Study
Pre-training decoder is better, because decoder does more difficult job of keeping semantics and generating sentence with correct syntax
Gains of pre-training greatly overlap
LM objective serves as a big regularizer, it assures fluency for sure
Personal Thoughts
Why only benchmark back-translation?
How can I train Language Model with internal monolingual data?
Authors did good and careful job of not corrupting pre-trained weights, knowing that LM weights provide good fluency features
Abstract
Details
Basic Methodology
Improvement
Result
Ablation Study
Personal Thoughts
Link : https://arxiv.org/pdf/1611.02683.pdf Authors : Ramachandran et al. 2017