larsga / Duke

Duke is a fast and flexible deduplication engine written in Java
Apache License 2.0
616 stars 193 forks source link

Levenshtein comparator: compactDistance and distance methods produces absolutely different results #239

Open knetkachou opened 7 years ago

knetkachou commented 7 years ago

I've downloaded sources and saw"compactDistance" is default method used in in Levenshtein comparator. "distance" method described as original, naive implementation, using the Wagner & Fischer algorithm from 1974.

Comparing two strings "emma" and "ema".

When using default "compactDistance" in compare method it returns 0 (dist=0) and I receive result 1 from comparator. So it supposes those strings are exactly the same but they are not.

When using "distance" method it returns 1 and I receive result 0.6666666666666667 from comparator. I think it's good.

So does the "compactDistance" method work incorrectly?