mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

BLEU and CHRF reports wrong scores when any hypothesis is empty #239

Open SantiagoEG opened 12 months ago

SantiagoEG commented 12 months ago

Hello,

Thank you for your contribution with this library. I am experimenting a problem computing BLEU and CHRF when some hypothesis are empty strings. The code to reproduce the problem is the following:

import sacrebleu as s

print("Version:", s.version) bleu = s.BLEU() chrf = s.CHRF()

hypothesis_1 = ['A B C', 'B C D', 'C D E'] hypothesis_2 = ['', 'B C D', 'C D E'] hypothesis_3 = ['A B C', '', 'C D E'] hypothesis_4 = ['A B C', '', '']

refs = [['A B C'], ['B C D'], ['C D E']]

print() print("hypothesis_1 CHRF:", chrf.corpus_score(hypothesis_1, refs).score) print("hypothesis_1 BLEU:", chrf.corpus_score(hypothesis_1, refs).score) print() print("hypothesis_2 CHRF:", chrf.corpus_score(hypothesis_2, refs).score) print("hypothesis_2 BLEU:", chrf.corpus_score(hypothesis_2, refs).score) print() print("hypothesis_3 CHRF:", chrf.corpus_score(hypothesis_3, refs).score) print("hypothesis_3 BLEU:", chrf.corpus_score(hypothesis_3, refs).score) print() print("hypothesis_4 CHRF:", chrf.corpus_score(hypothesis_4, refs).score) print("hypothesis_4 BLEU:", chrf.corpus_score(hypothesis_4, refs).score)

This code produces the following outputs:

Version: 2.3.1

hypothesis_1 CHRF: 100.0 hypothesis_1 BLEU: 100.0

hypothesis_2 CHRF: 0.0 hypothesis_2 BLEU: 0.0

hypothesis_3 CHRF: 100.0 hypothesis_3 BLEU: 100.0

hypothesis_4 CHRF: 100.0 hypothesis_4 BLEU: 100.0

I have not experienced this problem for TER. Do you recommend me to use metrics at sentence level and compute the mean?

Best