Questions on Table 3 of the MASS paper?

Epsilon-Lee commented 4 years ago

Hi, dear authors:

I wonder how you conduct the DAE pretraining experiments showned in Table 3. (I only list en-fr results below)

Table 3. The BLEU score comparisons between MASS and other pre-training methods.	Method	en-fr
BERT+LM	33.4	32.3
DAE	30.1	28.3
MASS	37.5	34.9

As you described in your paper,

"The second baseline is DAE, which simply uses denoising auto-encoder to pretrain the encoder and decoder."

So my questions are:

Are you simply use the pair (c(x), x) from two language to pre-train the seq2seq model? That is the same DAE as in DAE loss + back-translation for finetuning of XLM.
Are there any tricks to make it work?

I have done experiments on using DAE to pre-train the seq2seq model, but when I continue to train with (only) BT [1], I only get the following BLEU scores which is far less than your reported ones, so I wonder if I misunderstand some details.

Table. My run	Method	en-fr	fr-en
DAE	11.20	10.68

Please help me out here, thanks a lot.

Footnote. [1] One difference from my training setting and yours is that: during fine-tuning, I only use bt loss instead of both DAE and BT.

StillKeepTry commented 4 years ago

@Epsilon-Lee We conduct this ablation study by using this code. Using DAE in a pure-shared model will lead to identity mapping, you can try the older unsupervised nmt framework, which supports some modules unshared.

Epsilon-Lee commented 3 years ago

Thanks a lot for your response, I will definitely have a try and then come back to report the figures!

microsoft / MASS

Questions on Table 3 of the MASS paper? #149