vickumar1981 / stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..
https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html
Other
78 stars 15 forks source link

longestCommonSeq implementation has exponential complexity #64

Closed anschlechter closed 3 years ago

anschlechter commented 3 years ago

longestCommonSeq is not efficient, it takes a very long time if the strings to be compared have a large distance.

The implemented algorithm has exponential complexity O(2^(n+m)).

Expected Result

predictable performance. algorithm should have O(n*m) complexity as is possible.

c.f. https://www.techiedelight.com/longest-common-subsequence/

--> translated to SCALA:

def LCSLength(X: String, Y: String): Int = { val m = X.length val n = Y.length // lookup table stores solution to already computed subproblems, // i.e., T[i][j] stores the length of LCS of substring // X[0…i-1] and Y[0…j-1] val T = Array.ofDim[Int](m + 1, n + 1) // fill the lookup table in a bottom-up manner for (i <- 1 to m) { for (j <- 1 to n) { if (X.charAt(i - 1) == Y.charAt(j - 1)) { // if the current character of X and Y matches T(i)(j) = T(i - 1)(j - 1) + 1 } else { // otherwise, if the current character of X and Y don't match T(i)(j) = Integer.max(T(i - 1)(j), T(i)(j - 1)) } } } // LCS will be the last entry in the lookup table T(m)(n) }

Actual Result

"unpredictable" runtime

Reproduction Steps

Compare "RUE NELSON MANDELA" with "RUE ALBERT UNGEHEUER" -> it just takes way to long to find out that these are not the same.

vickumar1981 commented 3 years ago

@anschlechter thanks for reporting the issue.

~~the implementation used in the library for longestCommonSeq is here:
https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/impl/LongestCommonSeqImpl.scala#L4~~ (fixed)

Will add that test case and compare implementations. :+1:

vickumar1981 commented 3 years ago

@anschlechter published a 1.2.5-SNAPSHOT with your changes included here: https://github.com/vickumar1981/stringdistance/pull/65

Sonyatype snapshot repo: https://oss.sonatype.org/content/repositories/snapshots/com/github/vickumar1981/

Testing on my end, it looks to be much faster :). Thanks for the great test cases and the implementation fix!

If you could/would possibly pull down or use the 1.2.5-SNAPSHOT and confirm that this fixes the issue, then I can merge these changes and include them in the 1.2.5 release on maven.

Thanks again! cheers.

anschlechter commented 3 years ago

Hi, I tried the 1.2.5-SNAPSHOT and it looks good to me. Thank you.

vickumar1981 commented 3 years ago

Addressed by: https://github.com/vickumar1981/stringdistance/pull/66

Fixed in version 1.2.5. Published v1.2.5 artifact to sonatype and should be available on maven central shortly, once it syncs.