CER calculation: divide distance by length of alignment path

bertsky commented 3 years ago

In the current implementation, the denominator of the CER calculation is the reference (first) text's length:

https://github.com/impactcentre/ocrevalUAtion/blob/84f15b894eccd05365a9116f03b5b8b97c1b74b6/src/main/java/eu/digitisation/output/ErrorMeasure.java#L53

But if you divide by the reference length instead of the maximum length, you can get CER contributions > 1.

AFAIK this is not correct. The Levenshtein edit distance is symmetrical, and so should be the rate calculation based on it.

bertsky commented 3 years ago

Same for

https://github.com/impactcentre/ocrevalUAtion/blob/84f15b894eccd05365a9116f03b5b8b97c1b74b6/src/main/java/eu/digitisation/output/ErrorMeasure.java#L74

and

https://github.com/impactcentre/ocrevalUAtion/blob/84f15b894eccd05365a9116f03b5b8b97c1b74b6/src/main/java/eu/digitisation/output/ErrorMeasure.java#L95

and probably also

https://github.com/impactcentre/ocrevalUAtion/blob/84f15b894eccd05365a9116f03b5b8b97c1b74b6/src/main/java/eu/digitisation/output/CharStatTable.java#L174

rccarrasco commented 3 years ago

In the current implementation, the denominator of the CER calculation is the reference (first) text's length:

https://github.com/impactcentre/ocrevalUAtion/blob/84f15b894eccd05365a9116f03b5b8b97c1b74b6/src/main/java/eu/digitisation/output/ErrorMeasure.java#L53

But if you divide by the reference length instead of the maximum length, you can get CER contributions > 1.

AFAIK this is not correct. The Levenshtein edit distance is symmetrical, and so should be the rate calculation based on it.

Levenshtein distance is indeed symmmetric but the texts here are not on equal footing: one is the groud-truth while the other is the output. By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...". Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

bertsky commented 3 years ago

By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...".

In the limit, yes. But that figure is the correct one – or would you rather have 100% then?

EDIT sorry, I actually meant 0% accuracy (as being correct in the limit). So, I must contradict you: normalizing to the max will get a 100% CER in the limit.

Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

Empty output will also become 0% for the max-length denominator, not infinite.

EDIT Here I also meant accuracy, not CER. Empty output will also be 100% CER.

And the asymmetrical denominator can yield larger than 100% rates.

kba commented 3 years ago

By normalizing to the max one will get a 0% CER if the reference text is 'AB' and the output is "ABBBBBBB...".

In the limit, yes. But that figure is the correct one – or would you rather have 100% then?

They might have the first two tokens in common but since the longer one has lots of additions, in my intuition this should register as 100% incorrect. Such "more errors than tokens" cases where CER > 100% I've always interpreted as CER == 100% (but still written them down to investigate).

Normalizing to the ground-truth makes more sense here (an empty output will be a 100% CER, as expected, instead of infinite).

Empty output will also become 0% for the max-length denominator, not infinite.

And the asymmetrical denominator can yield larger than 100% rates.

~~How about CER = levenshtein(a, b) / max(len(a), len(b))?~~ (disregard, that would again lead to AB vs ABBBBB being less than 100% :(

bertsky commented 3 years ago

How about CER = levenshtein(a, b) / max(len(a), len(b))

That's what I was arguing for (as being the only correct implementation).

disregard, that would again lead to AB vs ABBBBB being less than 100% :(

What do you mean, 100% accuracy or 100% error rate? (My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

kba commented 3 years ago

What do you mean, 100% accuracy or 100% error rate?

I mean error rate.

(My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

I think it's counter-intuitive, that however many spurious extra tokens are in the detection, CER will always be less than 100%. Also it converges towards 1, not 0, doesn't it? Because the (n + Inf) / Inf -> 1?

In fact, what about CER = min(1, distance(a,b) / min(len(a), len(b)))?

cer('AB', 'ABBBB') = 1
cer('AB', 'ABBBBBBBB') = 1

seems reasonable but then again

cer('ABBBB', 'A') = 1
cer('AB', 'A') = 1

perhaps not.

M3ssman commented 3 years ago

Actually, I do calculate the _correctionratio (CER inverted) between groundtruth gt and evaluation candidate like this:

dist = distance(gt,candidate)

if dist >= len(gt) return 0

return (len(gt) - dist) / len(gt)

bertsky commented 3 years ago

(My stance is that the more Bs that output has, the lower CER must become – down to zero in the limit.)

I think it's counter-intuitive,

Yes, sorry, I was too sloppy with my writing, I was really thinking of accuracy, not CER (see edits above).

So we are actually in agreement :-)

cases where CER > 100% I've always interpreted as CER == 100%

That's not the same, though. You cannot just arbitrarily saturate at 100%. The asymmetrical denominator is already biased, clipping does not unbias it.

In fact, what about CER = min(1, distance(a,b) / min(len(a), len(b)))?

Same as above (biased: this would make the range close to one much more likely than it is).

bertsky commented 3 years ago

I do calculate the _correctionratio (CER inverted) between groundtruth gt and evaluation candidate like this:
dist = distance(gt,candidate)

if dist >= len(gt) return 0

return (len(gt) - dist) / len(gt)

That's nothing else than accuracy = 1 - CER, though – in the (clipped) asymmetrical denominator implementation.

bertsky commented 3 years ago

I am wondering though, if the correct denominator really is not the max-length, but the length of alignment positions/operations (i.e. number of insertions plus deletions plus substitutions plus identities in the minimal alignment).

(This seems to be the implementation chosen in ISRI tools.)

mikegerber commented 3 years ago

Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes.

Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text.

bertsky commented 3 years ago

Rice's dissertation (https://www.stephenvrice.com/images/rice-dissertation.pdf) argues that the accuracy can be negative. In the same manner, I would argue that the CER can be >1 or even infinite. I would however clamp it to 1 for practical purposes.

Sure you can define it that way, but that would not be interpretable and thus not useful anymore. (Clipping does not make it more interpretable, only more biased.)

It now seems clear to me that definition was simply a misconception (on Rice's part and possibly followers). The section does not even discuss the alternatives of using maximum sequence length or simply alignment path length. The latter is the natural definition here, as it directly maps to the [0,1] range.

Why would the error rate be symmetrical? It's only similar to a the symmetrical distance, but the compared texts have clear roles: the GT reference and the compared text.

Because that rate is meant to approximate the probability of an average character being misrecognized, and that can (when assuming ergodicity) be expressed in terms of the distinct number of errors – but since the latter is (in the Levenshtein case) strictly symmetric, so the former must be, too.

M3ssman commented 1 year ago

@mikegerber Thank you for pointing back to the original dissertation from Rice!

His accuracy formula on p25 matches exactly what I tried to express above. It's completely evident for me to put it this way.

Regarding the possibility of getting negative accuracy: I've seen this in some cases, therefore the restriction if one encounters more errors than GT has characters at all, which is interpreted as

the extreme situation in which the entire correct string can be entered from scratch using fewer keystrokes than are needed to correct the generated string.

Though one must consider Rice himself in the OCR-System Benchmarks (cf. The Fifth Annual Test of OCR Accuracy) removed any results / systems which fell below a certain limit of accuracy 90% (!) and didn't try to make sense on corner cases like above.

bertsky commented 1 year ago

This is not about corner cases, though. CER is meant / intended as an empirical estimate of the probabilty of misrecognising a random character, assuming ergodicity. Using a biased denominator makes for a biased estimate, period. You usually aggregate CER as an arithmetical average over many pairs of strings (typically lines), and that means even a single line where the OCR text is longer than the GT text (i.e. a single CER beyond 1 – possibly many times larger) will distort the whole average arbitrarily.

Later formulations rightfully deviate from Rice by defining the denominator as $sum(i + s + d + c)$ (where $c$ is the number of correctly recognised characters) – which is equivalent to saying the denominator is the length of the alignment path. (This has been called normalized, too.)

impactcentre / ocrevalUAtion

CER calculation: divide distance by length of alignment path #21