However, I'm noticing a discrepancy in the BLEU score. While the paper reports a BLEU score of 35.32, my implementation produces a BLEU score of 32.43 (65.3/42.0/28.7/20.3 (BP = 0.912 ratio = 0.916 hyp_len = 5398 ref_len = 5894)).
Can you please confirm if there is a specific reason for this discrepancy or is there something I might be missing? Any advice or guidance would be greatly appreciated.
Hi, I'm trying to replicate the No Modification evaluation result as described in your paper.
I've installed sacrebleu==1.4.14 and adapted the evaluation code as follows:
and then running
eval_bleu_moses(ref_file='data/labelled/test.for', sys_file='data/labelled/test.inf')
However, I'm noticing a discrepancy in the BLEU score. While the paper reports a BLEU score of 35.32, my implementation produces a BLEU score of 32.43 (65.3/42.0/28.7/20.3 (BP = 0.912 ratio = 0.916 hyp_len = 5398 ref_len = 5894)).
Can you please confirm if there is a specific reason for this discrepancy or is there something I might be missing? Any advice or guidance would be greatly appreciated.
Thank you.