larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
613 stars 194 forks source link

Jaro-Winkler Comparator bugs #247

Closed dipplestix closed 6 years ago

dipplestix commented 6 years ago

The Jaro-WInkler comparator as is has two bugs in it.

1) The comparison is order dependent comparing (a, b) returns different results than (b, a) 2) There are issues with duplicated count when searching through s2.

I have a fix written, but need permission to create a branch

michalkurka commented 6 years ago

@dipplestix you should probably fork this project and create a PR to this one from your fork, I can help you with that

larsga commented 6 years ago

Yes, please fork and make a PR and we'll look at your fix. I haven't looked into it, but other people have identified the same two bugs in the Jaro-Winkler code, so I'm sure you're right.

dipplestix commented 6 years ago

Fixed in https://github.com/larsga/Duke/pull/248