TER between two empty strings is 100

BramVanroy commented 1 year ago

I am not sure whether this is expected behavior, but I just found out that when calculating TER between two empty strings yields a TER score of 100. I would expect 0.

The use case is subtitle correction. If the subtitle system does not generate a subtitle (empty string), and some post-editors correct that with a subtitle and others do not, I would expect that the edited version have a higher TER (insertions) whereas those that did not change the empty string would yield 0.

Is this expected behavior in sacrebleu/the original TERCOM implementation? If so, what is the reasoning behind it? Perhaps that "typically" a reference must always be a non empty-string? I understand that the final TER score is normalized by the number of reference tokens, but it is not intuitive that the score is 100 as a result of dividing by 0.

from sacrebleu import TER

ter_metric = TER() 

ref = ""
hyp = ""
score = ter_metric.sentence_score(hypothesis=hyp, references=[ref]).score

print(score)
# 100

ozancaglayan commented 1 year ago

Any ideas @ales-t ?

ales-t commented 1 year ago

Good question. I've done a quick check of tercom outputs:

For an empty reference and non-empty hypothesis, the output is Infinity. So sacrebleu is not consistent with tercom here.
If both are empty, tercom outputs NaN.

Since we're not consistent with tercom here anyway, maybe it would be worthwhile to change sacrebleu's behavior so that if the hypothesis matches, we would also return TER=0? (Seems like a good choice to me, at least at first glance.)

BramVanroy commented 1 year ago

Thanks @ales-t , that looks like a good solution indeed!

mjpost / sacrebleu

TER between two empty strings is 100 #228