mitsuhiko / similar

A high level diffing library for rust based on diffs
https://insta.rs/similar
Apache License 2.0
989 stars 34 forks source link

LCS isn't Hunt-McIlroy... and might not be LCS either? #44

Open tef opened 1 year ago

tef commented 1 year ago

From my barest understanding of the code[^3], the current algorithm tries to do the classical approach to computing a longest common subsequence:

Well, almost. The code in the original commit [^6] does it backwards then forwards:

Right now? Well, after [^2], the code doesn't seem to do that at all. The lines let i = new_len - i - 1; and let j = old_len - j - 1;got dropped.

This means:

This will produce a result, but it won't always produce the correct result:

For example: If the two strings are entirely different, the code will seem to work: It will pick delete in the loop, and then realise it bailed out early, and add in the missing inserts.

The test case also produces the right answer, but also for the wrong reason.

    let a: &[usize] = &[0, 1, 3, 4, 5];
    let b: &[usize] = &[0, 1, 4, 5, 8, 9];

This seems to work because:

I haven't directly tested it but I would assume that swapping the arguments 0,1,4,5,8,9 and 0,1,3,4,5 around produces the wrong output.

The other problem I think I've seen is that the LCS algorithm[^3] opens up with "Hunt–McIlroy / Hunt–Szymanski LCS diff algorithm." but it doesn't really look like the hunt/mcilroy/syzmanski LCS algorithm.

As mentioned above, the code takes the classical solution—building a matrix, backtracking—to find an LCS. The thing is, hunt/mcilroy[^4] and hunt/szymanski[^5] do not use a matrix of values.

Instead, both algorithms do the following:

Aside: The only difference between Hunt/Syzmanski and Hunt/McIlroy is that H/S considers the matches in MATCHLIST in descending order (which has simpler code), and Hunt/McIlroy considers the matches in ascending order (which has trickier but faster code). The Kuo/Cross algorithm[^6] explains it better than either original paper.

It's not much of a bug, by comparison, but it's what lead me to reading the source code in the first place, so I thought i'd mention it too.

[^1] Original commit https://github.com/mitsuhiko/similar/commit/bee5d88b0296ba0b2811d001fcec0c146fb439b8

[^2] "Performance Improvements" https://github.com/mitsuhiko/similar/commit/45bcb3943caaf02ebc82ccfa332b800765b503e3

[^3] https://github.com/mitsuhiko/similar/blob/main/src/algorithms/lcs.rs#L1

[^4] "An Algorithm for Differential File Comparison" Hunt/McIlroy https://www.cs.dartmouth.edu/~doug/diff.pdf

[^5] "A Fast Algorithm for Computing Longest Common Subsequences" Hunt/Szymanski https://dl.acm.org/doi/pdf/10.1145/359581.359603

[^6] "An improved algorithm to find the length of the longest common subsequence" Kuo/Cross https://dl.acm.org/doi/pdf/10.1145/74697.74702

mitsuhiko commented 1 year ago

Thank you for taking the time to report this issue. I apologize for any confusion caused by the misnamed diffing algorithm. I initially implemented the algorithm because Myer's has really terrible performance in some edge cases and I tried to do some experiments about what can be done about it. I have since added support for bailing after a deadline for all implemented algorithms. I tried finding a reference for the actual name for the LCS algorithm and was under the assumption that this is the correct name.

So not sure what the actual name for that implementation is. For now I will remove the name of the algorithm and just describe it as LCS. Generally speaking I would like to see some improvements in the actual implementation of the algorithms. In particular #15 is a huge omission from this library but I did not have the time and rigor to actually implement those.

Pull requests are absolutely appreciated to improvements in the implementation or other algorithms.

tef commented 1 year ago

So not sure what the actual name for that implementation is.

It's one of those folk algorithms that appears in many forms.

https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm - probably the most popular name. https://en.wikipedia.org/wiki/Wagner%E2%80%93Fischer_algorithm - lists a whole bunch of very similar algorithms.

It also appears in Hirschberg's "A linear space algorithm for computing maximal common subsequences" as Algorithm A. It's that common that even papers from the dawn of time don't bother to give it a name.

(Of note, myers diff was predated by ukkonen's near identical algorithm for diff, but the myers name stuck)

Myer's has really terrible performance in some edge cases

Sadly Quadratic behaviour is the norm. Hunt-McIlroy diff chugs along when the resulting diff is small, much like myers does when the resulting diff is large. There are some algorithms with better all round performance, though.

Claus Rick's "New Algorithms for the Longest Common Subsequence Problem." offers two algorithms which have better performance overall (which can be combined), but they aren't lightweight by any means. They won't be as fast as Myers-diff on very small edit sequences, but they are faster at more traditional approaches.

In theory, you could use a more rounded algorithm to calculate the edit script distance, but i'd expect it to be faster to run myers diff and just give up if the edit cost is too high.

The only other approach to handling edge cases that comes to mind is patience diff. In the search for unique elements, you could count the number of distinct elements in each file. Over a certain threshold, you could avoid doing myers diff to begin with.

I don't expect this to be helpful, beyond maybe throwing a few interesting links in your direction. Either way, thanks for taking the time to reply, too.

I'm still not sure if I read your code correctly about "the matrix is filled in forwards but looks ahead for results" but once I get a working rust environment I can confirm for myself.

tef commented 1 year ago

Well, I missed the rev()in the code, but also if you run lcs diff on 0, 1, 4, 5, 8, 9 as old, 0, 1, 3, 4, 5 as new, you get

 0
 1
-4
+3
-5
-8
-9
+4
+5

So, something's up.