BLEU and CHRF reports wrong scores when any hypothesis is empty

Hello,

Thank you for your contribution with this library. I am experimenting a problem computing BLEU and CHRF when some hypothesis are empty strings. The code to reproduce the problem is the following:

import sacrebleu as s

print("Version:", s.version) bleu = s.BLEU() chrf = s.CHRF()

hypothesis_1 = ['A B C', 'B C D', 'C D E'] hypothesis_2 = ['', 'B C D', 'C D E'] hypothesis_3 = ['A B C', '', 'C D E'] hypothesis_4 = ['A B C', '', '']

refs = [['A B C'], ['B C D'], ['C D E']]

print() print("hypothesis_1 CHRF:", chrf.corpus_score(hypothesis_1, refs).score) print("hypothesis_1 BLEU:", chrf.corpus_score(hypothesis_1, refs).score) print() print("hypothesis_2 CHRF:", chrf.corpus_score(hypothesis_2, refs).score) print("hypothesis_2 BLEU:", chrf.corpus_score(hypothesis_2, refs).score) print() print("hypothesis_3 CHRF:", chrf.corpus_score(hypothesis_3, refs).score) print("hypothesis_3 BLEU:", chrf.corpus_score(hypothesis_3, refs).score) print() print("hypothesis_4 CHRF:", chrf.corpus_score(hypothesis_4, refs).score) print("hypothesis_4 BLEU:", chrf.corpus_score(hypothesis_4, refs).score)

This code produces the following outputs:

Version: 2.3.1

hypothesis_1 CHRF: 100.0 hypothesis_1 BLEU: 100.0

hypothesis_2 CHRF: 0.0 hypothesis_2 BLEU: 0.0

hypothesis_3 CHRF: 100.0 hypothesis_3 BLEU: 100.0

hypothesis_4 CHRF: 100.0 hypothesis_4 BLEU: 100.0

I have not experienced this problem for TER. Do you recommend me to use metrics at sentence level and compute the mean?

Best

mjpost / sacrebleu

BLEU and CHRF reports wrong scores when any hypothesis is empty #239