howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Unsupervised Pretraining for Sequence to Sequence Learning #7

Open howardyclo opened 6 years ago

howardyclo commented 6 years ago

Metadata

Authors: Prajit Ramachandran, Peter J. Liu and Quoc V. Le Organization: Google Brain Conference: EMNLP 2017 Link: https://goo.gl/n2cKG9

howardyclo commented 6 years ago

Summary

This paper purposes to initialize the weights of encoder and decoder in neural sequence-to-sequence (Seq2Seq) model with two pre-trained language models, and then fine-tuned with labeled data. During the fine-tuning phase, they jointly train the Seq2Seq objective with the language modeling (LM) objectives to prevent overfitting. The experiment results show that their method achieves an improvement of 1.3 BLEU from the previous best models on both WMT'14 and WMT'15 English->German. Human evaluations on abstractive summarization also shows that their method outperforms a purely supervised learning baseline. (See more important insights on ablation studies)

This paper is also similar to #3, #5 and #6.

Figure 1


Fine-tuning with Language Modeling Objective

After the Seq2Seq is initialized with pre-trained language models (see Figure 1), the Seq2Seq is then fine-tuned with both the translation objective and the language modeling objective (multi-task learning), to avoid catastrophic forgetting (GoodFellow et al. 2013).


Other Improvements

  1. Residual connections: The residual connections can avoid the problem of feeding the random-permuted vector from the task-specific layer to softmax layer, since the task-specific layer is random initialized at first.
  2. Multi-layer attention: Attending on both task-specific layer and pre-trained layer.

Figure 2


Experiment Settings

Machine translation dataset

Abstractive summarization dataset


Experiment Results

Results on machine translation task:

Table 1

Figure 3 shows important Insights on ablation studies on machine translation.

Figure 3

Figure 4 shows that the unsupervisedly pre-trained models degrade less than the models without unsupervised pre-training as the labeled dataset becomes smaller.

Figure 4

Results on abstractive summarization task:

Table 2

The unsupervisedly pre-trained model is only able to match the baseline model, which uses a word2vec pre-trained embeddings + bi-LSTM encoder. Don't forget that this paper just uses an uni-directional LSTM as encoder in Seq2Seq. This result shows that just pre-training the embeddings itself gives a large improvement

Figure 5 shows important Insights on ablation studies on abstractive summarization.

Figure 5

Finally, human evaluation shows that not only the better generalization the unsupervised pre-training method can achieve, but also the decrease of unwanted repetition in abstractive summarization.


Personal Thoughts

Maybe we can use a bi-LSTM to pre-train a bidirectional language modeling instead of using only an uni-directional pre-trained LSTM purposed in this paper.


References