Closed cp-pc closed 4 years ago
I think the script evaluate_wmt.py
was never really tested. Note that t5-base
is not a fine-tuned model, but just a pretrained model. So you would definitely get better results by fine-tuning the model on translation. I'm not 100% sure, but I think the T5 paper shows the "non-finetuned" results for translation somewhere as well. Pinging @sshleifer - This might be interesting for as well.
What patrick said is exactly correct. I actually don't understand exactly which checkpoint points map to which table entries.
@craffel, is 22.15 a reasonable zero shot BLEU for t5-base on en-de/?
I am looking at appendix E, 3rd to rightmost column last page in arxiv, but am not sure which row corresponds to t5-base
without finetuning. A machine readable version of that table would also be super helpful if it is easy to find.
For future reference/readers, evaluate_wmt.py
has moved to examples/seq2seq/run_eval.py
:
and the new command (for en-romanian) is:
export DATA_DIR=wmt_en_ro
python run_eval.py t5-base \
$DATA_DIR/val.source t5_val_generations.txt \
--reference_path $DATA_DIR/val.target \
--score_path enro_bleu.json \
--task translation_en_to_ro \
# --n_obs 100 \
--device cuda \
--fp16 \
--bs 32
You would need to update the first few args for your paths.
I had some reasonable results finetuning mbart on WMT en-ro. Got BLEU 24 after finetuning vs 27 for mbart-large-en-ro. (both #s before preprocessing) I would be very interesting in seeing results/bug fixes for finetuning t5 on any language pair!
Hey all,
@anonymous1100
Is it necessary to fine-tune the t5 model to reproduce the results of the paper?
Yes. The pre-trained checkpoints are trained on a multi-task mixture and need further fine-tuning to achieve maximal performance. See paragraph "Multi-Task Pre-training" in Section 3.7:
... In Section 3.5.3, we showed that pre-training on a multi-task mixture of unsupervised and supervised tasks before fine-tuning worked as well as pre-training on the unsupervised task alone. This is the approach advocated by the “MT-DNN” [Liu et al., 2015, 2019b]. It also has the practical benefit of being able to monitor “downstream” performance for the entire duration of training, rather than just during fine-tuning. We therefore used multi-task pre-training in our final set of experiments.
@patrickvonplaten
I'm not 100% sure, but I think the T5 paper shows the "non-finetuned" results for translation somewhere as well.
No, we never reported those numbers, but they are trivial to get by running eval on the pre-trained checkpoints, e.g.
gsutil -m cp -r gs://t5-data/pretrained_models/base/* "${MODEL_DIR}"
t5_mesh_transformer \
--tpu="${TPU_NAME}" \
--gcp_project="${PROJECT}" \
--tpu_zone="${ZONE}" \
--model_dir="${MODEL_DIR}" \
--gin_file="gs://t5-data/pretrained_models/base/operative_config.gin" \
--gin_file="eval.gin" \
--gin_file="beam_search.gin" \
--gin_param="MIXTURE_NAME = 'wmt_t2t_ende_v003'" \
--gin_param="run.dataset_split = 'test'" \
--gin_param="eval_checkpoint_step = 'all'" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" # or whatever
@sshleifer
I actually don't understand exactly which checkpoint points map to which table entries.
The released checkpoints are (multi-task) pre-trained models which, after fine-tuning, produce the numbers in Table 14. We don't report the results before fine-tuning, and we didn't (and won't) release the fine-tuned checkpoints.
is 22.15 a reasonable zero shot BLEU for t5-base on en-de/?
I ran the above command and got 28.664, so that seems very low. Not familiar with the HF eval script but I can take a look if you need ideas for figuring out what went wrong.
I am looking at appendix E, 3rd to rightmost column last page in arxiv, but am not sure which row corresponds to t5-base without finetuning.
None of the rows in that table correspond to any of the T5 models. Those numbers are the results of our giant systematic (ablation) study that we did before training any of the T5 models.
A machine readable version of that table would also be super helpful if it is easy to find.
The LaTeX source on arxiv have the tables in a format that would be easily parseable to whatever machine-readable format. https://arxiv.org/e-print/1910.10683
I think we figured out what went wrong. The tokenizer is not adding eos_token="</s>"
to the source document.
It should be, right?
The inputs should definitely have an EOS before they are fed into the model. If it's the convention in Transformers that the tokenizer takes care of that, then yes! In the T5 codebase, the tokenizer ittself does not add an EOS; that's handled by the packing and padding code.
Awesome! is there a bos
token that goes before sequence (After the prefix?)
like <s>
in Roberta/GPT2/Bart?
(Is this the packing/padding code? https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py)
is there a bos token that goes before sequence (After the prefix?)
Nope.
Is this the packing/padding code? https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/data/preprocessors.py
No, the packing/padding code is not part of the T5 codebase (T5 just provides (tokenized/preprocessed) sequences). It's assumed that it's handled by whatever the model implementation is. Here it is in the Mesh TF codebase: https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/dataset.py
Adding EOS does not appear to help zero shot performance in my first experiment, but open to hearing others' results. From a fork of this repo, you can run
git fetch upstream
git checkout t5tok
to get a version of the tokenizer that adds EOS.
When I ran eval on wmt_en_ro, I got
t5tok (with `</s>`):27.65
master (no EOS): 27.87
The commands to reproduce are in the PR description
Would love to know results on other datasets!
For what it's worth I'm using T5 for other purposes (style transfer) and have found SotA results. It looks like the master branch has diverged, but among other changes I modified seq2seq.utils.encode_file to this:
lns = [prefix + text + " </s>" for text in lns]
Hey @sshleifer , thanks for getting #5866 in. Does this resolve the discrepancy that originally started this issue, i.e. that non-fine-tuned T5-Base gets 28.664 BLEU on WMT EnDe using MTF whereas the HF version got 22.15?
IDK how OP got 22.1., I somehow just got BLEU 34.513 for en-de on what I thought was wmt_en_de 2019 (I can try to rerun on identical data if given pointer to such.)
For en-ro, I was getting 27.85 before the change, 27.65 after.
I am using corpus_bleu
across the whole test set.
test data:
(21 mins on NVIDIA-RTX) Get Data:
cd examples/seq2seq/
mkdir -p gens
wget https://s3.amazonaws.com/datasets.huggingface.co/translation/wmt_en_de.tgz
tar -xzvf wmt_en_de.tar.gz
export D=${PWD}/wmt_en_de
Eval Command:
python run_eval.py t5-base $D/test.source --reference_path $D/test.target gens/t5_base_ende_test_gens.txt --score_path gens/t5_base_ende_test_bleu.json --bs 16 --task translation_en_to_de
Going to leave this open until I am satisfied that I am getting a reasonably close BLEU on reasonably close data.
Hey Sam, the validation set we used was newstest2013
. You can get the data here
https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.en
https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2013.de
or if you want to get exactly the same preprocessed inputs and outputs, you can use
python -c 'import t5; ds = t5.data.TaskRegistry.get("wmt_t2t_ende_v003").get_dataset({"inputs": 512, "targets": 512}, "validation")'
python -m t5.scripts.dump_task --task=wmt_t2t_ende_v003 --split=validation
I got {"bleu": 23.8653, "n_obs": 3000, "runtime": 319, "seconds_per_sample": 0.1063}
on that data (newstest2013.en
), with some sacrebleu warnings about my data not being properly detokenized.
WARNING:root:That's 100 lines that end in a tokenized period ('.')
WARNING:root:It looks like you forgot to detokenize your test data, which may hurt your score.
WARNING:root:If you insist your data is detokenized, or don't care, you can suppress this message with '--force'.
{'bleu': 24.2978}
with by passing tokenize='intl' to sacrebleu.Ran the t5 command now to try to check pre/post processing, but got:
AssertionError: Sizes do not match: 284246 vs 284253 for /home/shleifer/tensorflow_datasets/downloads/extracted/TAR_GZ.data.stat.org_wmt1_tran-task_trai-para-nc-6LWgxBgzCHdv_LtotNmnXjpCH6OhzkF8D3v10aRrznA.tgz/training-parallel-nc-v13/news-commentary-v13.de-en.de vs /home/shleifer/tensorflow_datasets/downloads/extracted/TAR_GZ.data.stat.org_wmt1_tran-task_trai-para-nc-6LWgxBgzCHdv_LtotNmnXjpCH6OhzkF8D3v10aRrznA.tgz/training-parallel-nc-v13/news-commentary-v13.de-en.en.
I think it is building more than just the val set for 1 language.
I got {"bleu": 23.8653, "n_obs": 3000, "runtime": 319, "seconds_per_sample": 0.1063} on that data (newstest2013.en), with some sacrebleu warnings about my data not being properly detokenized.
We have always run the BLEU on the TFDS versions and I don't ever recall seeing that error. Maybe there is something wrong with the text files I linked? I think sacrebleu can also download the appropriate test sets.
Ran the t5 command now to try to check pre/post processing, but got:
That looks like a TFDS error, not sure how that would be happening. Do you want to open an issue in the TFDS repo and tag @adarob?
Yeah I can file an issue. Do you have an easy way to share your repo's generations? Mine are here t5_base_newstest2013.de
Do you mean the predictions from T5 when run via the Mesh TF Transformer? Here are the inputs/targets/predictions that got spit out when I ran https://github.com/huggingface/transformers/issues/5543#issuecomment-656901662
wmttmp_test_eval_wmt_t2t_ende_v003_targets.txt wmttmp_test_eval_wmt_t2t_ende_v003_inputs.txt wmttmp_test_eval_wmt_t2t_ende_v003_999900_predictions.txt
Also, I apologize, I got mixed up in the span of time between when this issue started and now. This thread is about a mismatch of performance on the test set, but since this issue was re-opened last week I was thinking we were discussing the validation set. You should use newstest2014
; that is the test set used in the paper, mentioned in https://github.com/huggingface/transformers/issues/5543#issue-651544580, and is what I ran to get the predictions above and the score in https://github.com/huggingface/transformers/issues/5543#issuecomment-656901662 Here are the corresponding text files from Stanford NLP
https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2014.en
https://nlp.stanford.edu/projects/nmt/data/wmt14.en-de/newstest2014.de
On that data:
evaluation: used (pass tokenize='intl' to calculate_bleu). This improves both scores by 0.7 BLEU.
Cool, that is not too far off. Are you using beam search with the same hparams that we used? If so, we could maybe chalk this up to numerical differences. If not, I bet that beam search would explain a 0.7 BLEU difference.
That was the issue -- now equivalent! I got huggingface: 28.51 by adding --max_length=128 --length_penalty=0.6
. (Num beams was already correct in the config.)
semi-interestingly, you can get 28.7 by adding "translate English to German: translate English to German:" (twice) to every source example (I was doing this by accident).
Awesome, great sleuthing! Good to hear that there is no disparity here.
I downloaded the "newstest2014.en" and "newstest2014.de" datasets. Then I used examples/translation/t5/evaluate_wmt.py to evaluate the BLEU value of en_to_de, and the BLEU finally obtained is equal to 22.15, which is much lower than the paper. I used the t5-base model and my transformers version is 2.11.0. Is there something wrong with my operation? Is it necessary to fine-tune the t5 model to reproduce the results of the paper?