howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Unsupervised Machine Translation using Monolingual Corpora Only #1

Open howardyclo opened 6 years ago

howardyclo commented 6 years ago

Metadata

howardyclo commented 6 years ago

Summary

This paper purposed an unsupervised approach to neural machine translation (NMT) using monolingual corpora only. The principle is first use unsupervised word-by-word translation model, iteratively improve this model based on denoising and adversarial training to align latent distribution of different languages.

Figure 1

Figure 2

Motivation


Model


Training Objectives

Initialization

The model starts with an unsupervised naive translation model obtained by making word-by-word translation of sentences using a parallel dictionary learned in an unsupervised way (Conneau et al. 2017). Then, at each iteration, the model are trained by minimizing an objective function that measures their ability to both reconstruct and translate from a noisy input sentence.

Denoising Auto-Encoding (Critical to performance)

Equation 1 Δ is the sum of token-level cross-entropy loss between the sentence x and the reconstruction from the noisy sentence x_hat. The noisy model C is created by dropping and swapping tokens in the sentence.

Cross Domain Training (Back-translation)

Equation 2 Learn to reconstruct the sentence x from C(y), where y = M(x) is the translation of x, M is the current translation model, and Δ is again the sum of token-level cross-entropy loss.

Adversarial Training

Jointly train a discriminator to classify the language given the encoding of source sentences and the encoding of target sentences (Ganin et al. 2016). In detail, the discriminator operates on a sequence of encoded hidden state vectors, and produces a binary prediction (0: source; 1: target). The discriminator is trained by minimizing the cross-entropy loss:

Equation 3-1

The encoder is trained instead to fool the discriminator:

Equation 3-2

Final Objective

Equation 4 λauto, λcd, λadv are hyper-parameters.


Training Algorithm

Algorithm 1

Model Selection

Since no parallel data for validation, they use the surrogate criterion: Equation 5

The figure bellow shows the correlation between this measure and the final translation model performance evaluated with parallel test set.

Figure 3


Experiments

Datasets and Preprocessing

Table 1

Baselines

Unsupervised Dictionary Learning

Experimental Details

Experimental Results

Table 2

The following right figure shows their unsupervised approach obtains the same performance than a supervised NMT model trained on about 100000 parallel sentences.

Figure 4

Ablation Study

Table 4


Personal Thoughts

The purposed word-by-word translation and word reordering baselines are still competitive to the purposed unsupervised method in the context of WMT datasets, indicating that unsupervised NMT still needs a lot of efforts to improve.


References

howardyclo commented 6 years ago

For comparison to the other similar paper: Unsupervised Neural Machine Translation by Artetex et al. 2017, please refer to the slides.

elyarAbad commented 3 years ago

Hi! Sorry! In the published paper, it is mentioned that "We will release the code to the public once the revision process is over". I will be really THANKFUL if you share the link to source code. Thank you very much!