tdebatty / java-string-similarity

Implementation of various string similarity and distance algorithms: Levenshtein, Jaro-winkler, n-Gram, Q-Gram, Jaccard index, Longest Common Subsequence edit distance, cosine similarity ...
Other
2.69k stars 409 forks source link

Jaro Winkler similarity on short strings #50

Open fabriziofortino opened 5 years ago

fabriziofortino commented 5 years ago

I am trying to use jaro wrinkler similarity to check colors strings coming from user inputted form against a palette of fixed colors.

Using jaro wrinkler similarity, I get these kind of results for very short strings:

Is it correct to get similarity = 0 in the first case?

saschaszott commented 5 years ago

The Jaro Similarity of ed and red is 0, since the number of matching characters (parameter m) is 0. Furthermore, the length of the common prefix of s1 and s2 (parameter l) is 0. This results in a Jaro-Winkler Similarity of 0 as

sim_jw = sim_j + l * 0.1 * (1 - sim_j) = 0 + 0 * 0.1 * 1 = 0

Jaro-Winkler gives more favorable ratings to strings that match from the beginning.

manshulgoel commented 5 years ago

when I compare 2 strings wrt jaroWinkler "abcdefghij","aaaaaaaaa" my output comes around 0.4023.....

when I check the same on https://asecuritysite.com/forensics/simstring It gives me 0.46 Kindly help in this regard.