vickumar1981 / stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html
Other
75 stars 15 forks source link

The method is bug -> CommonStringDistanceAlgo.getCommonChars #39

Closed brainliu81 closed 5 years ago

brainliu81 commented 5 years ago

def getCommonChars(s1: String, s2: String, halfLen: Int): String = { val commonChars = new StringBuilder() val strCopy = new StringBuilder(s2) var n = s1.length val m = s2.length s1.zipWithIndex.foreach{ case (ch, chIndex) => { var foundIt = false var j = math.max(0, chIndex - halfLen) while (!foundIt && j <= Math.min(chIndex + halfLen, m - 1)) { if (strCopy(j) == ch) { foundIt = true commonChars.append(ch) strCopy.setCharAt(j, '\0') } j += 1 } }} commonChars.toString }

vickumar1981 commented 5 years ago

@brainliu81 do you have a test case that i can use to debug/fix this? thanks. will take a look into this. apologies for the late response.

This function is used in the jaro and jaroWinkler implementations.

https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/impl/JaroImpl.scala#L16

Are those implementations not providing you a correct score for a pair of known values?

for example, given the two strings "MARTHA" and "MARHTA" . the jaro score should be 0.944 and the jaro-winkler score ought to be 0.961. I double-checked most of my test cases using this site here: https://asecuritysite.com/forensics/simstring