Closed bkj closed 3 years ago
Hey @bkj,
Thanks for the very in-detailed issue. It would be awesome if you could also share your custom scripts here to evaluate on the entire dataset. This indeed seems like a problem, I'll look into it
@patrickvonplaten Thanks for the quick response.
Code to run inference w/ the two models can be found here: https://github.com/bkj/hf_bert2bert_debug
By default, it just runs one batch to save time -- you can run on the whole test dataset by setting QUICKRUN = False
in each of the files.
BLEU scores on this batch are ~ 23 for HF and ~ 35 for TF.
Let me know what you think! I'm not super familiar w/ transformers
, so it's possible I'm making some pre/post-processing mistake -- so likely a good idea to double check my glue code.
Hey @bkj,
I'll try to allocate time to solve this problem. I think it is indeed a fundamental difference between the two implementations - will try to investigate. Thanks for your response!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Unstale
Sorry for replying that late!
The problem is that the original code for those translation models is not published so that debugging isn't really possible. The original github can be found here: https://github.com/google-research/google-research/tree/master/bertseq2seq and the pretrained weights here: https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1 in case someone is very motivated to take a deeper look.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Environment info
transformers
version: 4.0.0Who can help
@patrickvonplaten ; maybe @patil-suraj
Information
I'm trying to running the
transformers
implementation of WMT14 DE->EN translation, using thegoogle/bert2bert_L-24_wmt_de_en
checkpoint and instructions.The BLEU score I get using translations from
transformers
implementation are substantially lower than those I get from the official Tensorflow model -- 24.7 w/ HF vs 34.0 w/ the official implementation.To reproduce
The following snippet shows qualitative differences in the output of the models:
I can also share the (custom) scripts I'm using to run inference on the entire dataset and compute BLEU scores. Note I am using the same BLEU code for both implementations.
Expected behavior
I would expect the BLEU scores and the quality of the translations to be comparable.
Thanks!