Discrepancy in BLEU Score During No Modification Evaluation

Hi, I'm trying to replicate the No Modification evaluation result as described in your paper.

I've installed sacrebleu==1.4.14 and adapted the evaluation code as follows:

def eval_bleu_moses(ref_file: str, sys_file: str, evaluation_dir: str = "eval"):
    os.makedirs(evaluation_dir, exist_ok=True)
    subprocess.run([f"cat {ref_file} | {MOSES_DETOKENIZER} -l en > {evaluation_dir}/ref.txt"], shell=True)
    subprocess.run([f"cat {sys_file} | {MOSES_DETOKENIZER} -l en > {evaluation_dir}/sys.txt"], shell=True)
    with open(f"{evaluation_dir}/ref.txt",'r+') as file:
        refs = [file.read().split('\n')]
    with open(f"{evaluation_dir}/sys.txt",'r+') as file:
        sys = file.read().split('\n')
    bleu = sacrebleu.corpus_bleu(sys, refs)
    return bleu

and then running

eval_bleu_moses(ref_file='data/labelled/test.for', sys_file='data/labelled/test.inf')

However, I'm noticing a discrepancy in the BLEU score. While the paper reports a BLEU score of 35.32, my implementation produces a BLEU score of 32.43 (65.3/42.0/28.7/20.3 (BP = 0.912 ratio = 0.916 hyp_len = 5398 ref_len = 5894)).

Can you please confirm if there is a specific reason for this discrepancy or is there something I might be missing? Any advice or guidance would be greatly appreciated.

Thank you.

haryoa / stif-indonesia

Discrepancy in BLEU Score During No Modification Evaluation #19