larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
614 stars 194 forks source link

j-w bug #254

Closed dipplestix closed 2 years ago

dipplestix commented 6 years ago

Cause: After examining the implementation of Jaro-Winkler and comparing it to another implementation, I found the order dependence is caused by the following bug in the Duke Implementation: In the Jaro-Winkler algrorithm, for each character in a given position in the first string, there is a 'hunting range' of character positions in the second string where the algorithm looks for an identical character. Their implementation looks one position too far to the left (compared to other implentations), but not one position too far to the right.

So the character 'D' in 'RANDALL' matches the character 'D' in 'DAMARIS', but the character 'D' in 'DAMARIS' does not match against the character 'D' in 'RANDALL'. There is asymmetry in how far the algorithm hunts for matches.

This bug is always present but can only cause asymmetry issues when the two strings are identical in length, because otherwise, string arguments are sorted by length before entering the algorithm.

This bug can be fixed by changing line 52 of https://github.com/larsga/Duke/blob/master/duke-core/src/main/java/no/priv/garshol/duke/comparators/JaroWinkler.java to: for (int ix2 = Math.max(0, ix - maxdist+1);

Note.. I am adding “1” to the max expression above, which, results in the for-loop looking at one less position.

-Carolyn