tdebatty / java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Other
2.69k stars 410 forks source link

String similarity techniques issue #31

Closed nischay21 closed 7 years ago

nischay21 commented 7 years ago

We are working on Record linkage project. We are observing a strange behavior from all of the standard technique like Jaro Winkler, Levenshtein, N-Gram, Damerau-Levenshtein, Jaccard index, Sorensen-Dice which we used from your repository.

Say, String 1= MINI GRINDER KIT String 2= Weiler 13001 Mini Grinder Accessory Kit, For Use With Small Right Angle Grinders String 3= Milwaukee Video Borescope, Rotating Inspection Scope, Series: M-SPECTOR 360, 2.7 in 640 x 480 pixels High-Resolution LCD, Plastic, Black/Red

In the above case string 1 and string 2 are related the score of all the methods as shown below. Jaro Winkler -> 0.391666651 Levenshtein -> 75 N-Gram, -> 0.9375 Damerau -> 75 Jaccard index -> 0 Sorensen-Dice -> 0 Cosine -> 0

But string 1 and string 3 are not at all related, but distance method are giving very high score. Jaro Winkler -> 0.435714275 Levenshtein -> 133 N-Gram, -> 0.953571439 Damerau -> 133 Jaccard index -> 1 Sorensen-Dice -> 0 Cosine -> 0

Any thoughts .?

tdebatty commented 7 years ago

Hi,

I created a small example to reproduce your case: https://github.com/tdebatty/java-string-similarity/blob/master/src/main/java/info/debatty/java/stringsimilarity/examples/nischay21.java

The result of running this example is:

S1 vs S2
JaroWinkler : 0.6083333492279053
Levenshtein : 75.0
NGram : 0.9375
Damerau : 75.0
Jaccard : 1.0
SorensenDice : 1.0
Cosine : 1.0

S1 vs S3
JaroWinkler : 0.5642857253551483
Levenshtein : 133.0
NGram : 0.9535714387893677
Damerau : 133.0
Jaccard : 1.0
SorensenDice : 1.0
Cosine : 1.0

With .toLower()
S1 vs S2
JaroWinkler : 0.3916666507720947
Levenshtein : 64.0
NGram : 0.8062499761581421
Damerau : 64.0
Jaccard : 0.8194444444444444
SorensenDice : 0.6941176470588235
Cosine : 0.4427217874246471

S1 vs S3
JaroWinkler : 0.42023807764053345
Levenshtein : 128.0
NGram : 0.925000011920929
Damerau : 128.0
Jaccard : 1.0
SorensenDice : 1.0
Cosine : 1.0

My first idea is that all those algorithms are case sensitive. From your description, I guess this is not the case for your application, so you should use .toLowerCase before calling the method distance(s1, s2), as shown in the example.

When using .toLowerCase, the distance method with s1 and s3 will produce a higher score then comparing s1 and s2 with the same algorithm.

nischay21 commented 7 years ago

@tdebatty Thank you, your solution was very helpful and worked perfectly after performing .toLowerCase.