jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.07k stars 159 forks source link

Some doubts regarding jaro_winkler_similarity and jaro_similarity results #183

Closed passcombo closed 1 year ago

passcombo commented 1 year ago

V 11.0 Tested for simple a/ab combinations, does not sem intuitive are result jaro_winkler_similarity ('a','ab') = 0.85 jaro_similarity ('a','ab') = 0.83

according to wiki 1/3 of sum should be about 0.5 ?

https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

jaro_winkler_similarity ('a','ba') = 0.0 jaro_similarity ('a','ba') = 0.0

same here should be about 0.5 ?

Or maybe I confused something?

jamesturk commented 1 year ago

Hi,

I'm not sure I understand your question or what you're doing to get your results where the similarity equals zero. Or why it would equal 0.5.

As for the first part, the equation on Wikipedia shows that

image

This means 1/3*(1/2+1+1) which is 0.83333 -- the winkler modification accounts for the boost to 0.85.

passcombo commented 1 year ago

Thanks, I counted the transposition wrongly. But in any case the other example with reversed order "a" vs "ba" gives zero, which indicates the method jaro_winkler_similarity has some limitations and the other method - damerau_levenshtein_distance - seems to work better for me.