t5-base translation_en_to_de BLEU lower than the paper

cp-pc commented 4 years ago

I downloaded the "newstest2014.en" and "newstest2014.de" datasets. Then I used examples/translation/t5/evaluate_wmt.py to evaluate the BLEU value of en_to_de, and the BLEU finally obtained is equal to 22.15, which is much lower than the paper. I used the t5-base model and my transformers version is 2.11.0. Is there something wrong with my operation? Is it necessary to fine-tune the t5 model to reproduce the results of the paper?

patrickvonplaten commented 4 years ago

I think the script evaluate_wmt.py was never really tested. Note that t5-base is not a fine-tuned model, but just a pretrained model. So you would definitely get better results by fine-tuning the model on translation. I'm not 100% sure, but I think the T5 paper shows the "non-finetuned" results for translation somewhere as well. Pinging @sshleifer - This might be interesting for as well.

sshleifer commented 4 years ago

What patrick said is exactly correct. I actually don't understand exactly which checkpoint points map to which table entries.

@craffel, is 22.15 a reasonable zero shot BLEU for t5-base on en-de/? I am looking at appendix E, 3rd to rightmost column last page in arxiv, but am not sure which row corresponds to t5-base without finetuning. A machine readable version of that table would also be super helpful if it is easy to find.

For future reference/readers, evaluate_wmt.py has moved to examples/seq2seq/run_eval.py: and the new command (for en-romanian) is:

export DATA_DIR=wmt_en_ro
python run_eval.py t5-base \
    $DATA_DIR/val.source t5_val_generations.txt \
    --reference_path $DATA_DIR/val.target \
    --score_path enro_bleu.json \
    --task translation_en_to_ro \
    # --n_obs 100 \
    --device cuda \
    --fp16 \
    --bs 32

You would need to update the first few args for your paths.

I had some reasonable results finetuning mbart on WMT en-ro. Got BLEU 24 after finetuning vs 27 for mbart-large-en-ro. (both #s before preprocessing) I would be very interesting in seeing results/bug fixes for finetuning t5 on any language pair!

craffel commented 4 years ago

Hey all,

@anonymous1100

Is it necessary to fine-tune the t5 model to reproduce the results of the paper?

Yes. The pre-trained checkpoints are trained on a multi-task mixture and need further fine-tuning to achieve maximal performance. See paragraph "Multi-Task Pre-training" in Section 3.7:

... In Section 3.5.3, we showed that pre-training on a multi-task mixture of unsupervised and supervised tasks before fine-tuning worked as well as pre-training on the unsupervised task alone. This is the approach advocated by the “MT-DNN” [Liu et al., 2015, 2019b]. It also has the practical benefit of being able to monitor “downstream” performance for the entire duration of training, rather than just during fine-tuning. We therefore used multi-task pre-training in our final set of experiments.

@patrickvonplaten

I'm not 100% sure, but I think the T5 paper shows the "non-finetuned" results for translation somewhere as well.

No, we never reported those numbers, but they are trivial to get by running eval on the pre-trained checkpoints, e.g.

gsutil -m cp -r gs://t5-data/pretrained_models/base/* "${MODEL_DIR}"
t5_mesh_transformer \
  --tpu="${TPU_NAME}" \
  --gcp_project="${PROJECT}" \
  --tpu_zone="${ZONE}" \
  --model_dir="${MODEL_DIR}" \
  --gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
  --gin_file="eval.gin" \
  --gin_file="beam_search.gin" \
  --gin_param="MIXTURE_NAME = 'wmt_t2t_ende_v003'" \
  --gin_param="run.dataset_split = 'test'" \
  --gin_param="eval_checkpoint_step = 'all'" \
  --gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'"  # or whatever

@sshleifer

I actually don't understand exactly which checkpoint points map to which table entries.

The released checkpoints are (multi-task) pre-trained models which, after fine-tuning, produce the numbers in Table 14. We don't report the results before fine-tuning, and we didn't (and won't) release the fine-tuned checkpoints.

is 22.15 a reasonable zero shot BLEU for t5-base on en-de/?

I ran the above command and got 28.664, so that seems very low. Not familiar with the HF eval script but I can take a look if you need ideas for figuring out what went wrong.

I am looking at appendix E, 3rd to rightmost column last page in arxiv, but am not sure which row corresponds to t5-base without finetuning.

None of the rows in that table correspond to any of the T5 models. Those numbers are the results of our giant systematic (ablation) study that we did before training any of the T5 models.

A machine readable version of that table would also be super helpful if it is easy to find.

The LaTeX source on arxiv have the tables in a format that would be easily parseable to whatever machine-readable format. https://arxiv.org/e-print/1910.10683

sshleifer commented 4 years ago

I think we figured out what went wrong. The tokenizer is not adding eos_token="</s>" to the source document.

It should be, right?

craffel commented 4 years ago

The inputs should definitely have an EOS before they are fed into the model. If it's the convention in Transformers that the tokenizer takes care of that, then yes! In the T5 codebase, the tokenizer ittself does not add an EOS; that's handled by the packing and padding code.

sshleifer commented 4 years ago

Awesome! is there a bos token that goes before sequence (After the prefix?) like <s> in Roberta/GPT2/Bart?

(Is this the packing/padding code? https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py)

craffel commented 4 years ago

is there a bos token that goes before sequence (After the prefix?)

Nope.

Is this the packing/padding code? https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py

No, the packing/padding code is not part of the T5 codebase (T5 just provides (tokenized/preprocessed) sequences). It's assumed that it's handled by whatever the model implementation is. Here it is in the Mesh TF codebase: https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/dataset.py

sshleifer commented 4 years ago

Adding EOS does not appear to help zero shot performance in my first experiment, but open to hearing others' results. From a fork of this repo, you can run

git fetch upstream
git checkout t5tok

to get a version of the tokenizer that adds EOS.

When I ran eval on wmt_en_ro, I got

t5tok (with `</s>`):27.65
master (no EOS): 27.87

The commands to reproduce are in the PR description

Would love to know results on other datasets!

tetsef commented 4 years ago

For what it's worth I'm using T5 for other purposes (style transfer) and have found SotA results. It looks like the master branch has diverged, but among other changes I modified seq2seq.utils.encode_file to this: lns = [prefix + text + " </s>" for text in lns]

craffel commented 4 years ago

Hey @sshleifer , thanks for getting #5866 in. Does this resolve the discrepancy that originally started this issue, i.e. that non-fine-tuned T5-Base gets 28.664 BLEU on WMT EnDe using MTF whereas the HF version got 22.15?

sshleifer commented 4 years ago

IDK how OP got 22.1., I somehow just got BLEU 34.513 for en-de on what I thought was wmt_en_de 2019 (I can try to rerun on identical data if given pointer to such.)

For en-ro, I was getting 27.85 before the change, 27.65 after.

I am using corpus_bleu across the whole test set.

test data:

verified that it is same as sacrebleu besides newline at end of file.

To Reproduce

(21 mins on NVIDIA-RTX) Get Data:

cd examples/seq2seq/
mkdir -p gens
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_de.tgz
tar -xzvf wmt_en_de.tar.gz
export D=${PWD}/wmt_en_de

Eval Command:

python run_eval.py t5-base $D/test.source --reference_path $D/test.target gens/t5_base_ende_test_gens.txt --score_path gens/t5_base_ende_test_bleu.json --bs 16 --task translation_en_to_de

Going to leave this open until I am satisfied that I am getting a reasonably close BLEU on reasonably close data.

sshleifer commented 4 years ago

same 34.15 from

sacrebleu -t wmt16 -l en-de < gens/t5_base_ende_test_gens.txt

Translations: here

craffel commented 4 years ago

Hey Sam, the validation set we used was newstest2013. You can get the data here https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.en https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.de or if you want to get exactly the same preprocessed inputs and outputs, you can use

python -c 'import t5; ds = t5.data.TaskRegistry.get("wmt_t2t_ende_v003").get_dataset({"inputs": 512, "targets": 512}, "validation")'
python -m t5.scripts.dump_task --task=wmt_t2t_ende_v003 --split=validation

sshleifer commented 4 years ago

I got {"bleu": 23.8653, "n_obs": 3000, "runtime": 319, "seconds_per_sample": 0.1063} on that data (newstest2013.en), with some sacrebleu warnings about my data not being properly detokenized.

WARNING:root:That's 100 lines that end in a tokenized period ('.')
WARNING:root:It looks like you forgot to detokenize your test data, which may hurt your score.
WARNING:root:If you insist your data is detokenized, or don't care, you can suppress this message with '--force'.

{'bleu': 24.2978} with by passing tokenize='intl' to sacrebleu.

Ran the t5 command now to try to check pre/post processing, but got:

AssertionError: Sizes do not match: 284246 vs 284253 for /home/shleifer/tensorflow_datasets/downloads/extracted/TAR_GZ.data.stat.org_wmt1_tran-task_trai-para-nc-6LWgxBgzCHdv_LtotNmnXjpCH6OhzkF8D3v10aRrznA.tgz/training-parallel-nc-v13/news-commentary-v13.de-en.de vs /home/shleifer/tensorflow_datasets/downloads/extracted/TAR_GZ.data.stat.org_wmt1_tran-task_trai-para-nc-6LWgxBgzCHdv_LtotNmnXjpCH6OhzkF8D3v10aRrznA.tgz/training-parallel-nc-v13/news-commentary-v13.de-en.en.

I think it is building more than just the val set for 1 language.

craffel commented 4 years ago

I got {"bleu": 23.8653, "n_obs": 3000, "runtime": 319, "seconds_per_sample": 0.1063} on that data (newstest2013.en), with some sacrebleu warnings about my data not being properly detokenized.

We have always run the BLEU on the TFDS versions and I don't ever recall seeing that error. Maybe there is something wrong with the text files I linked? I think sacrebleu can also download the appropriate test sets.

Ran the t5 command now to try to check pre/post processing, but got:

That looks like a TFDS error, not sure how that would be happening. Do you want to open an issue in the TFDS repo and tag @adarob?

sshleifer commented 4 years ago

Yeah I can file an issue. Do you have an easy way to share your repo's generations? Mine are here t5_base_newstest2013.de

craffel commented 4 years ago

Do you mean the predictions from T5 when run via the Mesh TF Transformer? Here are the inputs/targets/predictions that got spit out when I ran https://github.com/huggingface/transformers/issues/5543#issuecomment-656901662

wmttmp_test_eval_wmt_t2t_ende_v003_targets.txt wmttmp_test_eval_wmt_t2t_ende_v003_inputs.txt wmttmp_test_eval_wmt_t2t_ende_v003_999900_predictions.txt

Also, I apologize, I got mixed up in the span of time between when this issue started and now. This thread is about a mismatch of performance on the test set, but since this issue was re-opened last week I was thinking we were discussing the validation set. You should use newstest2014; that is the test set used in the paper, mentioned in https://github.com/huggingface/transformers/issues/5543#issue-651544580, and is what I ran to get the predictions above and the score in https://github.com/huggingface/transformers/issues/5543#issuecomment-656901662 Here are the corresponding text files from Stanford NLP https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2014.en https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2014.de

sshleifer commented 4 years ago

On that data:

huggingface: 27.9273
mesh: 28.6642 (from your .predictions) file.

evaluation: used (pass tokenize='intl' to calculate_bleu). This improves both scores by 0.7 BLEU.

craffel commented 4 years ago

Cool, that is not too far off. Are you using beam search with the same hparams that we used? If so, we could maybe chalk this up to numerical differences. If not, I bet that beam search would explain a 0.7 BLEU difference.

sshleifer commented 4 years ago

That was the issue -- now equivalent! I got huggingface: 28.51 by adding --max_length=128 --length_penalty=0.6. (Num beams was already correct in the config.)

semi-interestingly, you can get 28.7 by adding "translate English to German: translate English to German:" (twice) to every source example (I was doing this by accident).

craffel commented 4 years ago

Awesome, great sleuthing! Good to hear that there is no disparity here.

huggingface / transformers

t5-base translation_en_to_de BLEU lower than the paper #5543

To Reproduce