Silent failure with incorrect reference format in Python API

The docstring for sacrebleu.metrics.BLEU.corpus_score says this:

    :param references: A sequence of reference documents with document being
    defined as a sequence of reference strings. If `None`, cached references
    will be used.

This suggests that for a corpus with N documents and K annotators, references should be a list of N lists of K strings. But in reality the function expects the transpose of that (K lists of N strings).

If you do feed N lists of K strings, the function computes BLEU for the first K documents (albeit with some mismatched reference strings) and silently throws away the rest.

To prevent such misuse, I think it would be good to raise an exception or warning if the lengths of the inner reference lists don't match the length of the hypothesis list.

mjpost / sacrebleu

Silent failure with incorrect reference format in Python API #220