microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.11k stars 206 forks source link

Questions on Table 3 of the MASS paper? #149

Open Epsilon-Lee opened 4 years ago

Epsilon-Lee commented 4 years ago

Hi, dear authors:

I wonder how you conduct the DAE pretraining experiments showned in Table 3. (I only list en-fr results below)

Table 3. The BLEU score comparisons between MASS and other pre-training methods. Method en-fr fr-en
BERT+LM 33.4 32.3
DAE 30.1 28.3
MASS 37.5 34.9

As you described in your paper,

"The second baseline is DAE, which simply uses denoising auto-encoder to pretrain the encoder and decoder."

So my questions are:

  1. Are you simply use the pair (c(x), x) from two language to pre-train the seq2seq model? That is the same DAE as in DAE loss + back-translation for finetuning of XLM.
  2. Are there any tricks to make it work?

I have done experiments on using DAE to pre-train the seq2seq model, but when I continue to train with (only) BT [1], I only get the following BLEU scores which is far less than your reported ones, so I wonder if I misunderstand some details.

Table. My run Method en-fr fr-en
DAE 11.20 10.68

Please help me out here, thanks a lot.

Footnote. [1] One difference from my training setting and yours is that: during fine-tuning, I only use bt loss instead of both DAE and BT.

StillKeepTry commented 4 years ago

@Epsilon-Lee We conduct this ablation study by using this code. Using DAE in a pure-shared model will lead to identity mapping, you can try the older unsupervised nmt framework, which supports some modules unshared.

Epsilon-Lee commented 3 years ago

Thanks a lot for your response, I will definitely have a try and then come back to report the figures!