Reproducing table 2 results

Hello and thank you very much for your contribution to the field and open-sourcing the code.

I am trying to reproduce the table 2 results from the paper using the code specified here. I had to add a value for -max_seq_length since the command wouldn't run otherwise. I also train for 40k steps instead of 20k steps, as is specified in the paper. Otherwise, I am running the exact same command.

The results I obtain are different from the ones shown in Table 2 in the paper. Here's what I obtain

{"cs-en": {"kendall": 0.45062611806797853, "pearson": 0.6431185796263721, "spearman": 0.6229406024891663, "wmt_da_rr_kendall": -1.0, "sys-kendall": 1.0, "sys-pearson": 0.9755128178401778, "sys-spearman": 1.0}, "de-en": {"kendall": 0.4543906669799155, "pearson": 0.6222954800372593, "spearman": 0.6351588328764061, "wmt_da_rr_kendall": null, "sys-kendall": 0.5636363636363636, "sys-pearson": 0.8306197237216054, "sys-spearman": 0.8000000000000002}, "fi-en": {"kendall": 0.5518527983644262, "pearson": 0.7438350752272211, "spearman": 0.7497050145476959, "wmt_da_rr_kendall": null, "sys-kendall": 0.9999999999999999, "sys-pearson": 0.9914245385710291, "sys-spearman": 1.0}, "lv-en": {"kendall": 0.5359953999488883, "pearson": 0.7415974274659147, "spearman": 0.7285339831167466, "wmt_da_rr_kendall": null, "sys-kendall": 0.9444444444444445, "sys-pearson": 0.9648751710079252, "sys-spearman": 0.9833333333333333}, "ru-en": {"kendall": 0.5051750575006388, "pearson": 0.7192891660812555, "spearman": 0.6896893120559332, "wmt_da_rr_kendall": null, "sys-kendall": 0.8333333333333334, "sys-pearson": 0.9496574815839979, "sys-spearman": 0.9166666666666666}, "tr-en": {"kendall": 0.48177868642984917, "pearson": 0.6692174622720356, "spearman": 0.6613286166637742, "wmt_da_rr_kendall": null, "sys-kendall": 0.7333333333333333, "sys-pearson": 0.8750333198839647, "sys-spearman": 0.8545454545454544}, "zh-en": {"kendall": 0.47126245847176074, "pearson": 0.6771181689880665, "spearman": 0.6452188030847402, "wmt_da_rr_kendall": null, "sys-kendall": 0.7, "sys-pearson": 0.8358359469478294, "sys-spearman": 0.8617647058823529}}

Do you have any guidance as to what I might be doing wrong? Could it be that I'm not using the correct initial BLEURT checkpoint?

Thanks a lot

google-research / bleurt

Reproducing table 2 results #33