jamesturk / jellyfish

🪼 a python library for doing approximate and phonetic matching of strings.
https://jamesturk.github.io/jellyfish/
MIT License
2.04k stars 157 forks source link

Computation of half transpositions for Jaro metric #190

Closed vitalie-cracan closed 1 year ago

vitalie-cracan commented 1 year ago

The computation of half transpositions seems to be implemented differently from original paper (could not find a free Jaro paper, but here's one that is free from Winkler: https://www.researchgate.net/publication/245534659_Advanced_Methods_For_Record_Linkage. It is surely implemented differently in Java/Apache Commons: https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/JaroWinklerSimilarity.java#L163

The difference is that the halving is a float halving, not integer one. So 3 transpositions is equal to 1.5 half-transpositions, not 1.

jamesturk commented 1 year ago

1 is pretty common among implementations, and the route we chose. It is by design.

vitalie-cracan commented 1 year ago

If you have time, can you mention the reasons why this route was chosen? Thanks.

jamesturk commented 1 year ago

I'm not sure I recall the entire reasoning at the time, presumably to match an existing implementation or two. It's also nicer working with integers than floats :)

Someone a while ago suggested allowing setting custom weights, which I'd be glad to accept a PR for so long as it didn't break backwards compatibility.

On Thu, Jun 22, 2023, at 1:25 AM, vitalie-cracan wrote:

If you have time, can you mention the reasons why this route was chosen? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/jamesturk/jellyfish/issues/190#issuecomment-1602075438, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAB6YS3XDGZR3Z7QVKXO2DXMPQMRANCNFSM6AAAAAAZO2KQAQ. You are receiving this because you modified the open/close state.Message ID: @.***>