Difficulty replicating BLEU scores

dharakyu commented 3 years ago

Hello,

My group is attempting to replicate your experimental results but have been getting significantly different BLEU scores than reported in your paper. Our steps were as follows:

Prepared SGD dataset using prepare_dataset.py
Fine-tuned the T5-small model using the T2G2 train dataset.

When evaluated on the T2G2 test dataset, we recorded a higher BLEU score that was recorded in your paper. To diagnose why were getting such a high score, we ran the copy experiment as described in Table 4 of the paper (computing BLEU score between trivial input and gold standard) with the exact same parameters as described in the T5 repository. Here is the script we used to evaluate on the T2G2 test dataset. We recorded a BLEU score of 23.1 (compared to 18.8 in the paper).

We are wondering if you have an idea why there would be a discrepancy between these numbers we were getting and what you reported. Is there a particular way you are formatting the data that could account for the difference? I have been examining the T5 codebase but so far have been unable to find something significant enough in the implementation that might account for this delta.

Thank you and looking forward to hearing your thoughts.

Best, Dhara

mihirkale815 commented 3 years ago

Hi Dhara! Apologies for the late response. Could you share a sample of the data that was generated using prepare_dataset.py ?

dharakyu commented 3 years ago

Hi Mihir,

Thanks for your response! I've attached a sample of the training data. We reformatted the data for our model pipeline, such that the SYSTEM entry was the input and "utterance" was the output. I think our BLEU scores were about 3 points higher than you reported, so we were wondering if that could have possible arisen from the actual calculation of the BLEU scores?

Thanks, Dhara

t2g2_10_shot.txt

mihirkale815 commented 3 years ago

Thanks for sharing the data Dhara! The data itself looks okay to me. We reported BLEU scores generated by the T5 framework itself when an experiment is run, as specified in t5_tasks.py The code for the same can be found here. The script you are using also uses corpus_bleu from sacrebleu, but seems like T5 is passing some specific flags that your script is not. Maybe that is accounting for the difference?

dharakyu commented 3 years ago

That was it. Thanks Mihir!

google-research / task-oriented-dialogue

Difficulty replicating BLEU scores #1