mjpost / sacrebleu

Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
Apache License 2.0
1.03k stars 162 forks source link

TER between two empty strings is 100 #228

Closed BramVanroy closed 1 year ago

BramVanroy commented 1 year ago

I am not sure whether this is expected behavior, but I just found out that when calculating TER between two empty strings yields a TER score of 100. I would expect 0.

The use case is subtitle correction. If the subtitle system does not generate a subtitle (empty string), and some post-editors correct that with a subtitle and others do not, I would expect that the edited version have a higher TER (insertions) whereas those that did not change the empty string would yield 0.

Is this expected behavior in sacrebleu/the original TERCOM implementation? If so, what is the reasoning behind it? Perhaps that "typically" a reference must always be a non empty-string? I understand that the final TER score is normalized by the number of reference tokens, but it is not intuitive that the score is 100 as a result of dividing by 0.

from sacrebleu import TER

ter_metric = TER() 

ref = ""
hyp = ""
score = ter_metric.sentence_score(hypothesis=hyp, references=[ref]).score

print(score)
# 100
ozancaglayan commented 1 year ago

Any ideas @ales-t ?

ales-t commented 1 year ago

Good question. I've done a quick check of tercom outputs:

Since we're not consistent with tercom here anyway, maybe it would be worthwhile to change sacrebleu's behavior so that if the hypothesis matches, we would also return TER=0? (Seems like a good choice to me, at least at first glance.)

BramVanroy commented 1 year ago

Thanks @ales-t , that looks like a good solution indeed!