rockymadden / stringmetric

:dart: String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).
https://rockymadden.com/stringmetric/
486 stars 81 forks source link

Calculation of overlap coefficient may be incorrect #22

Closed mrkkrp closed 8 years ago

mrkkrp commented 9 years ago

According to your tests, overlap coefficient of "context" and "contact" is 0.7142857142857143, i.e. 5/7. This means that you count character 't' twice and intersection of these words is "contt".

Now if you read definition of overlap coefficient, you will find that it's defined in terms of sets. There cannot be two identical elements in a set. There is no order is a set. You only can say that an element is in set or not. Intersection of "context" and "contact" should be "cont" then.

Moreover, if we consider arguments as sets, denominator of our ratio be 5, not 7, because the set for "contact" is "conta".

So, result should be 4/5 = 0.8

In an old issue, you gave this link http://www.planetcalc.com/1664/ as a demonstration of 'loose' intersection. Note that this calculator works just like I've described:

C, O, N, T, E, X, T and C, O, N, T, A, C, T = C, N, O, T

I'm not sure about all this stuff. Maybe you know some special rules for string to set conversion? Maybe two distinct characters should be considered as distinct elements in a set?