tdebatty / java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Other
2.69k stars 410 forks source link

NGram exact match varying results #27

Closed dodgy99 closed 7 years ago

dodgy99 commented 7 years ago

I am using this library in an Apache Spark application (using scala).

I have been seeing variable results using the NGram algorithm where exact matches result is either "0.0" or "1.0". Below are some examples.

`QGram dig = new QGram(2);

dig.distance("S","S") //result = 1.0

dig.distance("Kirk","Kirk") //result = 0.0

dig.distance("07426796542","07426796542") //result = 0.0`

Should all these examples not result in a score of 1.0 as they are exactly the same?

tdebatty commented 7 years ago

Hi,

Thank you!

This happens because the strings "S" are two short (less then 2 characters). I will correct this and publish a new release...

tdebatty commented 7 years ago

Fixed in release 0.21