mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

Accelerating sentence scores #254

Closed AmitMY closed 8 months ago

AmitMY commented 8 months ago

I am working on visualizing the distribution of BLEU and chrF for my data. Specifically, I have a training set of size 8000 sentences, and a test set of size 2000 sentences, all from the same domain. I would like to compute all of the BLEU and chrF scores between the training and test sentences, and plot a distribution.

Right now, I do:

[[chrf.sentence_score(h, [r]) for r inreferences] for h in hypotheses]

But I am wondering if there might be a way to accelerate the inner loop [chrf.sentence_score(h, [r]) for r in references]? Feels like it would be possible since the _extract_corpus_statistics will be called over and over on the same sentences

jvamvas commented 8 months ago

@AmitMY I wrote something similar last week, could fit your use case: https://github.com/jvamvas/fastChrF

AmitMY commented 8 months ago

Thanks @jvamvas ! do you happen to know a similar solution for BLEU?

jvamvas commented 8 months ago

@AmitMY Yes, there seems to be one: https://github.com/Danial-Alh/fast-bleu by @Danial-Alh

from fast_bleu import BLEU

hyps = ["The cat sat on the mat .", "The cat sat on the hat ."]
refs = ["The cat sat on the mat .", "The fat cat sat on the mat .", "A cat sat on a mat ."]
hyps = [hyp.split() for hyp in hyps]
refs = [ref.split() for ref in refs]
scores = [BLEU([ref]).get_score(hyps)[4] for ref in refs]
np.array(scores).T
# array([[1.        , 0.72895452, 0.20556681],
#       [0.64345888, 0.39442436, 0.17567205]])

I did not test it in detail, though.

martinpopel commented 8 months ago

Just note that BLEU was designed as a document-level metric and applying it to single sentences results in scores which do not correlate with the translation quality, even when smoothing is used. chrF is a better choice in this respect.

SacreBLEU includes optimization for significance tests using pre-computed statistics, but it still assumes that the hypotheses and the reference(s) are sentence-aligned. The SacreBLEU API does not provide any method that would speed up computing sentence-level scores for the Cartesian product of two sets of sentences (train set and test set). I am not sure what is the purpose. If we are not interested in the full distribution, but only in its rightmost part (the highest BLEU), a large speedup could be achieved by comparing only sentence pairs that have unigram overlap higher than a given threshold, which could be computed quickly e.g. with Bloom filters or another technique for near-duplicate detection.