howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Using Target-side Monolingual Data for Neural Machine Translation through Multi-task Learning #5

Open howardyclo opened 6 years ago

howardyclo commented 6 years ago

Metadata

howardyclo commented 6 years ago

Summary

This paper purposes to train neural machine translation (NMT) model from scratch on both bilingual and target-side monolingual data in a multi-task setting by modifying the decoder in a neural sequence-to-sequence model to enable multi-task learning for target-side language modeling and translation.

This paper is similar idea to #3 , where #3 uses a recurrent neural network (RNN) to jointly train sequence labeling and bidirectional language modeling (predict context words). But the difference between them is that this paper only predicts the next target word from the representation of the lowest layer of the decoder, and an attentional recurrent model conditioned on source representation, while #3 predicts context words from the representation of the top layer of the RNN.

Though the experimental result shows some improvement, it doesn't improve significantly as training on synthetic using back-translation approach (Sennrich et al., 2016) as well.


Incorporating Monolingual Data by Adding Separate Decoder LM layer (LML)

Figure 1

Vanilla RNN decoder in seq2seq (see a. in Figure 1) are not available for monolingual corpora and previous work has tried to overcome this problem by either using synthetically generated source sequences or using a NULL token as the source sequence (Sennrich et al., 2016) As previously shown empirically, the model tends to forget source-side information if trained on much more monolingual than parallel data.

In their approach (see b. & c. in Figure 1 ), they use an additional "source-independent" decoder for target-side language modeling at the bottom of the original "source-dependent" decoder.

Multi-task Learning (MTL)

The language model can also be jointly trained with the translation task. They trained it through the joint loss:

Equation 11

The parameters (σ) of "source-independent" decoder are updated by gradients from both translation and language modeling, while the parameters (θ) of "source-dependent" decoder are updated only by gradients from translation task. Note: σ is part of θ.

In practice, mini-batch of training examples are composed of 50% parallel data, 50% monolingual data and set γ (gamma) to 1.


Experiments

Data statistics:

Table 2

Experimental Setup

Results

Table 1

Hypothesis (Or maybe insights...)

While separating out a language model allowed us to carry out multi-task training on mixed data types, it constrains gradients from monolingual data examples to a subset of source-independent network parameters (σ). In contrast, synthetic data always affects all network parameters (θ) and has a positive effect despite source sequences being noisy. We speculate that training from synthetic source data may also act as a model regularizer.


References