LCS isn't Hunt-McIlroy... and might not be LCS either?

tef commented 1 year ago

From my barest understanding of the code[^3], the current algorithm tries to do the classical approach to computing a longest common subsequence:

Compute a N x M matrix of distances from (0,0) to (N,M), row by row.
Use this matrix to calculate the LCS by backtracking through the matrix.

Well, almost. The code in the original commit [^6] does it backwards then forwards:

It Computes a NxM matrix of distances from the end of the strings (n,m) to the beginning of the strings (0,0), row-by-row, looking ahead to see if the score below or to the left is bigger.
Read through the matrix from the start (0,0) to the end (n,m), picking the smaller value where possible.

Right now? Well, after [^2], the code doesn't seem to do that at all. The lines let i = new_len - i - 1; and let j = old_len - j - 1;got dropped.

This means:

The matrix is computed in the forwards direction, adding in the values for j+1 and i+1, which will always be zero, unless old[i] and new[j] match, in which case it will be 1.
The LCS is also computed in the forwards direction, which does check to see if the current strings match, but will always pick delete over insert, as the matrix above is mostly filled with zeros.

This will produce a result, but it won't always produce the correct result:

For example: If the two strings are entirely different, the code will seem to work: It will pick delete in the loop, and then realise it bailed out early, and add in the missing inserts.

The test case also produces the right answer, but also for the wrong reason.

    let a: &[usize] = &[0, 1, 3, 4, 5];
    let b: &[usize] = &[0, 1, 4, 5, 8, 9];

This seems to work because:

0,1 are removed as a common prefix.
The code defaults to deleting, and drops 3, and will now compare 4,5 vs 4,5,8,9.
The code does not check the matrix for if i+1 and j+1 is a smaller value, it checks the string directly, so 4,5 are processed
The loop exits, and 8,9 is added as an insert.

I haven't directly tested it but I would assume that swapping the arguments 0,1,4,5,8,9 and 0,1,3,4,5 around produces the wrong output.

0,1 are removed as a common prefix
The code defaults to deleting, and instead of inserting '3', it will delete '4,5,8,9'
The loop exits and '3,4,5' will be inserted.

The other problem I think I've seen is that the LCS algorithm[^3] opens up with "Hunt–McIlroy / Hunt–Szymanski LCS diff algorithm." but it doesn't really look like the hunt/mcilroy/syzmanski LCS algorithm.

As mentioned above, the code takes the classical solution—building a matrix, backtracking—to find an LCS. The thing is, hunt/mcilroy[^4] and hunt/szymanski[^5] do not use a matrix of values.

Instead, both algorithms do the following:

They preprocess one of the files to produce a histogram/inverted index MATCHLIST
For each line in file A, they consider every match in file B (using the histogram), as a potential entry in the LCS.
They use a patience-sort alike approach to fill out a THRESH array of k-candidates (A k-candidate is a position (i,j) where A[i] = B[j] that could be the k'th element in a Longest common sequence)
The algorithm uses binary search to work out which THRESH array entry to overwrite, keeping another array of backpointers.

Aside: The only difference between Hunt/Syzmanski and Hunt/McIlroy is that H/S considers the matches in MATCHLIST in descending order (which has simpler code), and Hunt/McIlroy considers the matches in ascending order (which has trickier but faster code). The Kuo/Cross algorithm[^6] explains it better than either original paper.

It's not much of a bug, by comparison, but it's what lead me to reading the source code in the first place, so I thought i'd mention it too.

[^1] Original commit https://github.com/mitsuhiko/similar/commit/bee5d88b0296ba0b2811d001fcec0c146fb439b8

[^2] "Performance Improvements" https://github.com/mitsuhiko/similar/commit/45bcb3943caaf02ebc82ccfa332b800765b503e3

[^3] https://github.com/mitsuhiko/similar/blob/main/src/algorithms/lcs.rs#L1

[^4] "An Algorithm for Differential File Comparison" Hunt/McIlroy https://www.cs.dartmouth.edu/~doug/diff.pdf

[^5] "A Fast Algorithm for Computing Longest Common Subsequences" Hunt/Szymanski https://dl.acm.org/doi/pdf/10.1145/359581.359603

[^6] "An improved algorithm to find the length of the longest common subsequence" Kuo/Cross https://dl.acm.org/doi/pdf/10.1145/74697.74702

mitsuhiko commented 1 year ago

Thank you for taking the time to report this issue. I apologize for any confusion caused by the misnamed diffing algorithm. I initially implemented the algorithm because Myer's has really terrible performance in some edge cases and I tried to do some experiments about what can be done about it. I have since added support for bailing after a deadline for all implemented algorithms. I tried finding a reference for the actual name for the LCS algorithm and was under the assumption that this is the correct name.

So not sure what the actual name for that implementation is. For now I will remove the name of the algorithm and just describe it as LCS. Generally speaking I would like to see some improvements in the actual implementation of the algorithms. In particular #15 is a huge omission from this library but I did not have the time and rigor to actually implement those.

Pull requests are absolutely appreciated to improvements in the implementation or other algorithms.

tef commented 1 year ago

So not sure what the actual name for that implementation is.

It's one of those folk algorithms that appears in many forms.

https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm - probably the most popular name. https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm - lists a whole bunch of very similar algorithms.

It also appears in Hirschberg's "A linear space algorithm for computing maximal common subsequences" as Algorithm A. It's that common that even papers from the dawn of time don't bother to give it a name.

(Of note, myers diff was predated by ukkonen's near identical algorithm for diff, but the myers name stuck)

Myer's has really terrible performance in some edge cases

Sadly Quadratic behaviour is the norm. Hunt-McIlroy diff chugs along when the resulting diff is small, much like myers does when the resulting diff is large. There are some algorithms with better all round performance, though.

Claus Rick's "New Algorithms for the Longest Common Subsequence Problem." offers two algorithms which have better performance overall (which can be combined), but they aren't lightweight by any means. They won't be as fast as Myers-diff on very small edit sequences, but they are faster at more traditional approaches.

In theory, you could use a more rounded algorithm to calculate the edit script distance, but i'd expect it to be faster to run myers diff and just give up if the edit cost is too high.

The only other approach to handling edge cases that comes to mind is patience diff. In the search for unique elements, you could count the number of distinct elements in each file. Over a certain threshold, you could avoid doing myers diff to begin with.

I don't expect this to be helpful, beyond maybe throwing a few interesting links in your direction. Either way, thanks for taking the time to reply, too.

I'm still not sure if I read your code correctly about "the matrix is filled in forwards but looks ahead for results" but once I get a working rust environment I can confirm for myself.

tef commented 1 year ago

Well, I missed the rev()in the code, but also if you run lcs diff on 0, 1, 4, 5, 8, 9 as old, 0, 1, 3, 4, 5 as new, you get

 0
 1
-4
+3
-5
-8
-9
+4
+5

So, something's up.

mitsuhiko / similar

LCS isn't Hunt-McIlroy... and might not be LCS either? #44