google/bert2bert_L-24_wmt_de_en doesn't match official implementation

bkj commented 3 years ago

Environment info

transformers version: 4.0.0
Platform: Linux-5.4.0-1030-aws-x86_64-with-debian-buster-sid
Python version: 3.7.8
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help

@patrickvonplaten ; maybe @patil-suraj

Information

I'm trying to running the transformers implementation of WMT14 DE->EN translation, using the google/bert2bert_L-24_wmt_de_en checkpoint and instructions.

The BLEU score I get using translations from transformers implementation are substantially lower than those I get from the official Tensorflow model -- 24.7 w/ HF vs 34.0 w/ the official implementation.

To reproduce

The following snippet shows qualitative differences in the output of the models:

import datasets
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# --
# Load dataset

dataset  = datasets.load_dataset("wmt14", "de-en", split="test")
sentence = dataset[20]['translation']['de']
target   = dataset[20]['translation']['en']

print(target)
# If the street is clear, the pedestrian obtains a green light immediately, if not, there is a delay of around 15 seconds.

# --
# HF model

tokenizer  = AutoTokenizer.from_pretrained("google/bert2bert_L-24_wmt_de_en", pad_token="<pad>", eos_token="</s>", bos_token="<s>")
model      = AutoModelForSeq2SeqLM.from_pretrained("google/bert2bert_L-24_wmt_de_en")

input_ids  = tokenizer(sentence, return_tensors="pt", add_special_tokens=False).input_ids
output_ids = model.generate(input_ids)[0]

output_str = tokenizer.decode(output_ids, skip_special_tokens=True)

print(output_str)
# the road is free, it takes about 15 seconds if not directly for the footganger.

# --
# TF model

import tensorflow.compat.v1 as tf
import tensorflow_hub as hub
import tensorflow_text as tf_text

tf.disable_eager_execution()

# Load model
model = hub.Module('https://tfhub.dev/google/bertseq2seq/bert24_de_en/1')

# Setup session
sess = tf.InteractiveSession()
sess.run(tf.tables_initializer())
sess.run(tf.global_variables_initializer())

# Define graph

src       = tf.placeholder(tf.string, shape=[None])
translate = model(src)

# Translate
output_str = sess.run(translate, feed_dict = {
    src : [sentence]
})

print(output_str[0])
# "If the road is clear, there is a green area for the pedestrian, if not it takes about 15 seconds."

I can also share the (custom) scripts I'm using to run inference on the entire dataset and compute BLEU scores. Note I am using the same BLEU code for both implementations.

Expected behavior

I would expect the BLEU scores and the quality of the translations to be comparable.

Thanks!

patrickvonplaten commented 3 years ago

Hey @bkj,

Thanks for the very in-detailed issue. It would be awesome if you could also share your custom scripts here to evaluate on the entire dataset. This indeed seems like a problem, I'll look into it

bkj commented 3 years ago

@patrickvonplaten Thanks for the quick response.

Code to run inference w/ the two models can be found here: https://github.com/bkj/hf_bert2bert_debug

By default, it just runs one batch to save time -- you can run on the whole test dataset by setting QUICKRUN = False in each of the files.

BLEU scores on this batch are ~ 23 for HF and ~ 35 for TF.

Let me know what you think! I'm not super familiar w/ transformers, so it's possible I'm making some pre/post-processing mistake -- so likely a good idea to double check my glue code.

patrickvonplaten commented 3 years ago

Hey @bkj,

I'll try to allocate time to solve this problem. I think it is indeed a fundamental difference between the two implementations - will try to investigate. Thanks for your response!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

patil-suraj commented 3 years ago

Unstale

patrickvonplaten commented 3 years ago

Sorry for replying that late!

The problem is that the original code for those translation models is not published so that debugging isn't really possible. The original github can be found here: https://github.com/google-research/google-research/tree/master/bertseq2seq and the pretrained weights here: https://tfhub.dev/google/bertseq2seq/roberta24_bbc/1 in case someone is very motivated to take a deeper look.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers