roy-ht / editdistance

Fast implementation of the edit distance(Levenshtein distance)
MIT License
661 stars 62 forks source link

Feature: Convert edit distance to ratio/similarity #28

Open odannyc opened 5 years ago

odannyc commented 5 years ago

I would like a way to get the similarity of the 2 strings instead of just the distance. For example with python-Levenshtein I can do:

>>> Levenshtein.ratio("ab", "abc")
0.8
maxbachmann commented 3 years ago

It is worth noting, that Levenshtein.ratio does not return the normalized Levenshtein distance which is calulated as:

1 - lev_dist / max(len1, len2)

but returns the normalized InDel distance (similar to Levenshtein distance but does not allow Substitutions). Which is normalized as:

1 - InDel_dist / (len1 + len2)

I provide a similar metric in my library RapidFuzz: https://maxbachmann.github.io/RapidFuzz/string_metric.html#normalized-levenshtein However in editdistance it should be simple to calculate the normalized version yourself.

bertsky commented 3 years ago

the normalized Levenshtein distance which is calulated as:

1 - lev_dist / max(len1, len2)

[...] However in editdistance it should be simple to calculate the normalized version yourself.

I think this is also not correct though. It should be the length of the alignment path as denominator (which is slightly longer than the longest sequence). See this comment

maxbachmann commented 3 years ago

I think this is also not correct though. It should be the length of the alignment path as denominator (which is slightly longer than the longest sequence).

I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;) Do you happen to have a reference or explanation why this is needed?

E.g. the WER is commonly implemented as

WER = lev_dist / len(reference) = (S+D+I)/(S+D+C)
WAcc = 1 - WER

This has the disadvantage, that WER might be over 1. The implementation I mentioned is a modification to use the length of the longer string/list to make sure the error rate is always between 0 and 1. It also appeared to be a simple solution to map the Levenshtein distance using arbitrary weights (here the normalization I use in RapidFuzz for a range 0-100) similarity

Dividing the edit distance with the maximum possible edit distance made sense to me.

bertsky commented 3 years ago

I am a complete noob on these OCR topics. I simply needed a fast implementation and found none, so I did build my own ;)

Looks promising, will have a look.

Do you happen to have a reference or explanation why this is needed?

Not sure for the right reference (most papers do not even treat this detail), but I found this: https://www.yorku.ca/mack/nordichi2002-shortpaper.html

Equation 3 is recommended because it always yields the same error rate as that obtained by a character-by-character analysis of the optimal alignments, as just described. In general, Equation 3 yields a slightly lower error rate than Equation 1 since len(alignment) >= max(|A|, |B|).

E.g. the WER is commonly implemented as

WER = lev_dist / len(reference) = (S+D+I)/(S+D+C)
WAcc = 1 - WER

This has the disadvantage, that WER might be over 1

Yes, exactly. Same for CER. Unfortunately, even fairly standard tools (used a lot for published results) make that mistake (see here). Others get it right, though.

The implementation I mentioned is a modification to use the length of the longer string/list to make sure the error rate is always between 0 and 1. Dividing the edit distance with the maximum possible edit distance made sense to me.

Yes, using the maximum distance is already much better than using the reference string length. My point was that even max-dist is biased: the actual alignment path can be longer. (For example, aligning abcde with bcdef would yield [(a,·) (b,b) (c,c) (d,d) (e,e) (·,f)] which has length 6, not 5.)