mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

Silent failure with incorrect reference format in Python API #220

Open mdarcy220 opened 1 year ago

mdarcy220 commented 1 year ago

The docstring for sacrebleu.metrics.BLEU.corpus_score says this:

    :param references: A sequence of reference documents with document being
    defined as a sequence of reference strings. If `None`, cached references
    will be used.

This suggests that for a corpus with N documents and K annotators, references should be a list of N lists of K strings. But in reality the function expects the transpose of that (K lists of N strings).

If you do feed N lists of K strings, the function computes BLEU for the first K documents (albeit with some mismatched reference strings) and silently throws away the rest.

To prevent such misuse, I think it would be good to raise an exception or warning if the lengths of the inner reference lists don't match the length of the hypothesis list.