Reported results cant be achieved in Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models

huggingface / blog

Public repo for HF blog posts

https://hf.co/blog

2.36k stars 745 forks source link

Reported results cant be achieved in Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models #67

Open zmf0507 opened 3 years ago

zmf0507 commented 3 years ago

@patrickvonplaten I have been trying to achieve a bleu score of 31.7 (as reported in the blog and paper for WMT en->de evaluation) using hugging-face model google/bert2bert_L-24_wmt_en_de but I could only achieve 23.77 on newstest2014 test set . I have kept the beam search config as mentioned in the paper that is num_beams = 4 and length penalty = 0.6 , fixed the max length = 128 as was done during training in paper. I have also used the bleu script mentioned in the footnotes .

Can you please tell what could be missing in this whole process and how can I achieve the similar scores?

patrickvonplaten commented 3 years ago

Hey @zmf0507 - I think this is indeed a bug on our side. I'm working on it at the moment, see: https://github.com/huggingface/transformers/issues/9041

zmf0507 commented 3 years ago

ohh!! I hope it gets fixed soon. Anyways, thanks for the information. Please update here when it gets fixed.

zmf0507 commented 3 years ago

@patrickvonplaten is there any update ?

patrickvonplaten commented 3 years ago

Sorry for replying that late!

The problem is that the original code for those translation models is not published so that debugging isn't really possible. The original github can be found here: https://github.com/google-research/google-research/tree/master/bertseq2seq and the pretrained weights here: https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1 in case someone is very motivated to take a deeper look.