Closed sxjscience closed 3 years ago
Table 5 in the paper shows the difference in tokenized BLEU between models trained without/with distillation. We found that enc12-dec1 was about 0.5 BLEU points worse than enc6-dec6 under the same raw data training condition. So indeed, I see some discrepancy here. I believe sacrebleu scores are usually a bit worse than tokenized ones, but can you confirm that your enc6-dec6 result is from training on the raw data?
Yes, I preprocess the raw WMT2014 corpus with the standard pipeline: 1) cleaning + Moses normalization and tokenization, 2) Learn a subword model, 3) Train the NMT model. When evaluating the generators, I will first detokenize with the subword model and then detokenize with MosesTokenizer. After that, I called SacreBLEU to evaluate.
I rerun the experiments and find that the enc12-dec1 model can achieve 26.2
sacreBLEU score (trained without distillation) while the enc6-dec6 model can achieve 27.0
sacreBLEU score. It's a 0.8 BLEU score gap. May be the difference will be sightly amplified if both models are evaluated with sacreBLEU.
Thank you for the update. What was the difference between your original run and this run? You said you weren't able to get 26.0 sacrebleu originally?
The major difference is that I clipped the gradient norm to 1.0. In addition, I used lp_alpha=0.0 in the beamsearch and averaged the last 10 checkpoints. For me, I have been expecting a smaller gap between enc12-dec1 and enc6-dec6.
Thanks for sharing the source code of the paper "Deep Encoder, Shallow Decoder"! I'm recently trying to reproduce the paper. As a first step, I haven't used any distillation and is directly training the model end-to-end. However, when evaluating the model (enc12-dec1) with sacrebleu, I find that it can hardly exceed 26.0 BLEU score, while the enc6-dec6 structure can reach 27.0 BLEU score. I later noticed that similar performance has also been observed in Huggingface: https://huggingface.co/allenai/wmt16-en-de-12-1, in which the release pretrained model obtained 25.75 sacrebleu score.
Is it normal that we can only obtain BLEU score < 26 without any distillation for the enc12-dec1 architecture? The dataset I'm using is the WMT2014 EN-DE dataset.