Importance of distillation

jungokasai / deep-shallow

MIT License

42 stars 4 forks source link

Importance of distillation #4

Closed sxjscience closed 3 years ago

sxjscience commented 3 years ago

Thanks for sharing the source code of the paper "Deep Encoder, Shallow Decoder"! I'm recently trying to reproduce the paper. As a first step, I haven't used any distillation and is directly training the model end-to-end. However, when evaluating the model (enc12-dec1) with sacrebleu, I find that it can hardly exceed 26.0 BLEU score, while the enc6-dec6 structure can reach 27.0 BLEU score. I later noticed that similar performance has also been observed in Huggingface: https://huggingface.co/allenai/wmt16-en-de-12-1, in which the release pretrained model obtained 25.75 sacrebleu score.

Is it normal that we can only obtain BLEU score < 26 without any distillation for the enc12-dec1 architecture? The dataset I'm using is the WMT2014 EN-DE dataset.

jungokasai commented 3 years ago

Table 5 in the paper shows the difference in tokenized BLEU between models trained without/with distillation. We found that enc12-dec1 was about 0.5 BLEU points worse than enc6-dec6 under the same raw data training condition. So indeed, I see some discrepancy here. I believe sacrebleu scores are usually a bit worse than tokenized ones, but can you confirm that your enc6-dec6 result is from training on the raw data?

sxjscience commented 3 years ago

Yes, I preprocess the raw WMT2014 corpus with the standard pipeline: 1) cleaning + Moses normalization and tokenization, 2) Learn a subword model, 3) Train the NMT model. When evaluating the generators, I will first detokenize with the subword model and then detokenize with MosesTokenizer. After that, I called SacreBLEU to evaluate.

sxjscience commented 3 years ago

I rerun the experiments and find that the enc12-dec1 model can achieve 26.2 sacreBLEU score (trained without distillation) while the enc6-dec6 model can achieve 27.0 sacreBLEU score. It's a 0.8 BLEU score gap. May be the difference will be sightly amplified if both models are evaluated with sacreBLEU.

jungokasai commented 3 years ago

Thank you for the update. What was the difference between your original run and this run? You said you weren't able to get 26.0 sacrebleu originally?

sxjscience commented 3 years ago

The major difference is that I clipped the gradient norm to 1.0. In addition, I used lp_alpha=0.0 in the beamsearch and averaged the last 10 checkpoints. For me, I have been expecting a smaller gap between enc12-dec1 and enc6-dec6.