mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.07k stars 164 forks source link

How do I use sacreBleu with a Syriac language #182

Closed mt-empty closed 2 years ago

mt-empty commented 2 years ago

I'm building my own English to Syriac translation model using Huggingface libraries and sentencpiece tokenizer, and I'd like to use sacrebleu as my evaluation metric.

So I tried this:

metric = datasets.load_metric("sacrebleu")
predictions = ['ܒܸܬ ܦܵܝܫܵܐ ܩܒܸܠܬܵܐ']
references = [["ܒܸܬ ܦܵܝܫܵܐ ܩܒܸܠܬܵܐ"]]
metric.compute(predictions=predictions, references=references)

And it results in:

{'bp': 1.0,
 'counts': [3, 2, 1, 0],
 'precisions': [100.0, 100.0, 100.0, 0.0],
 'ref_len': 3,
 'score': 0.0,
 'sys_len': 3,
 'totals': [3, 2, 1, 0]}

I'm guessing just like bleu, sacrebleu is language independent, but I think its filtering out Syriac characters, hence why I'm getting a score of 0. I couldn't find a way to disable the filtering, and about the--language-pair option, Syriac doesn't have an ISO 639-1 code, it only has ISO 639-2 code, which means ISO 639-2 languages aren't supported at all?

Any solutions or alternatives?

Thank you

martinpopel commented 2 years ago

SacreBLEU does not filter out Syriac characters, but the score is 0 by definition if there is no matching 4-gram. In your example, the whole test set consists of a single sentence with three words only, so it is impossible to get non-zero BLEU with any translation (prediction). BLEU was designed as a corpus-level metric, expecting hundreds or thousands of sentences (of usual lengths, i.e. > 4) in the test set. If you need sentence-level metric, try e.g. chrF (also implemented in sacrebleu and also wrapped in huggingface). If you really need sentence-level BLEU, you must configure smooth_value (and smooth_method), but if there is no matching 4-gram, most smoothing methods still result in zero score.

mt-empty commented 2 years ago

Thank you, I did not read the whole paper.

The maximum n-gram length is virtually always set to four, and since BLEU is corpus level, it is rare that there are any zero counts.