gliese1337 / fast-myers-diff

MIT License
47 stars 8 forks source link

Non optimal diff #17

Open caub opened 9 months ago

caub commented 9 months ago

Using the diff wrapper in https://github.com/gliese1337/fast-myers-diff/pull/16

const s1 = 'Itaque earum rerum consequatur';
const s2 = 'Itaque alias consequatur';

console.log(diff(s1, s2));
// [ [ 0, 'Itaque ' ], [ -1, 'e' ], [ 0, 'a' ], [ -1, 'rum rerum' ], [ 1, 'lias' ], [ 0, ' consequatur' ] ]

All good,

But just adding a final dot in one, gives something not optimal (optimal would just be an additional insertion)

const s1 = 'Itaque earum rerum consequatur';
const s2 = 'Itaque alias consequatur.';

console.log(diff(s1, s2));
// [ [ 0, 'Itaque ' ], [ -1, 'e' ], [ 0, 'a' ], [ -1, 'rum' ], [ 1, 'lias' ], [ 0, ' ' ], [ -1, 'rerum ' ], [ 0, 'consequatur' ], [ 1, '.' ] ]

That makes 9 items instead of 7 (Both fast-array-diff and fast-diff return 7 items)

gliese1337 commented 9 months ago

Thanks for the report. I'll look into this.

gliese1337 commented 9 months ago

We get this with the lcs:

lcs('Itaque earum rerum consequatur','Itaque alias consequatur') ->
[ 0, 0, 7 ] [ 8, 7, 1 ] [ 18, 12, 12 ]
[ 'Itaque ', 'a', ' consequatur' ]

lcs('Itaque earum rerum consequatur','Itaque alias consequatur.') ->
[ 0, 0, 7 ] [ 8, 7, 1 ] [ 12, 12, 1 ] [ 19, 13, 11 ]
[ 'Itaque ', 'a', ' ', 'consequatur' ]

And the underlying diff output is:

lcs('Itaque earum rerum consequatur','Itaque alias consequatur') ->
[ 7, 8, 7, 7 ] [ 9, 18, 8, 12 ] 
[ 'e', '' ], [ 'rum rerum', 'lias' ] 

lcs('Itaque earum rerum consequatur','Itaque alias consequatur.') ->
[ 7, 8, 7, 7 ] [ 9, 12, 8, 12 ] [ 13, 19, 13, 13 ] [ 30, 30, 24, 25 ]
[ 'e', '' ], [ 'rum', 'lias' ], [ 'rerum ', '' ], [ '', '.' ]

It works just fine when adding 2 periods. So something's got to be going wrong with the "divide" part of the "divide and conquer" that's causing it to trip up right on that boundary between 24 and 25 characters. 3 periods breaks it again, so it's probably an off-by-one parity thing.

gliese1337 commented 9 months ago

Looks like this bug has been around for a while; at least as far back as https://github.com/gliese1337/fast-myers-diff/commit/beeb1837e58e3e08df6e93b94988e90418292f7d

gliese1337 commented 9 months ago

I figured out what's going on; it's not actually a "bug" per se...

The issue is that, when passing in a string, you get a character-level diff. The length of the optimal diff is thus measured in terms of the number of characters inserted or deleted. Adjacent insertions and deletions just get grouped together for the sake of efficiency.

However, the optimal diff is not unique; there are multiple possible edit scripts that have the same minimal length. In this case, the difference is in which space character gets deleted--they are indistinguishable, so you have to get rid of one, but it doesn't matter which one. However, the different possible diffs are not equivalent in how convenient they are for grouping adjacent edits together. In this case, one choice results in retaining a space which splits runs of deletions apart, and the other choice retains a space which keeps the runs of deletions vs. retentions better collated. That's what makes the difference in the number of collected items in the output sequence. In this case, adding a single character alters the split point for the divide-and-conquer step of the algorithm in such a way that it flips from arbitrarily finding one of those possible equivalent character-level diffs to the other one. If, however, you count up the characters in the deletions and retentions, you'll find that they are equal--and the count of characters in the insertions goes up by one.

Now, it would indeed be a significant improvement in convenience if the algorithm could always pick the diff which, out of all optimal options, maximizes "fusibility", but I'm not at all sure how to do that. I might be able to tweak it to make different choices in this particular case, but that'll probably just "break" some other test case. This'll take some research.

make-github-pseudonymous-again commented 9 months ago

@caub What output would you expect for diff("aabbaabbaabbaa", "abababa")?

caub commented 9 months ago
> require('fast-diff')("aabbaabbaabbaa", "abababa")
[
  [ 0, 'a' ],  [ -1, 'ab' ],
  [ 0, 'ba' ], [ -1, 'ab' ],
  [ 0, 'ba' ], [ -1, 'a' ],
  [ 0, 'b' ],  [ -1, 'ba' ],
  [ 0, 'a' ]
]
make-github-pseudonymous-again commented 8 months ago

Why not:

[
  [ 0, 'a' ],  [ -1, 'ab' ],
  [ 0, 'ba' ], [ -1, 'ab' ],
  [ 0, 'ba' ], [ -1, 'ab' ],
  [ 0, 'ba' ],  [ -1, 'a' ]
]

?

caub commented 8 months ago

Oh yes defintitely, the less diff chunks the better!

note: it's also important to keep it efficient for larger strings

gliese1337 commented 8 months ago

As far as I can tell based on my understanding of various diff algorithms, it is not possible to always get the smallest number of chunks and remain efficient for larger inputs. The only way to do it is to check every possible shortest edit script, which requires quadratic time and memory. I might have missed something in my research, but the fact that Git doesn't bother to optimize diffs gives me some confidence in my conclusion.

We might be able to do slightly better specifically in cases like this, where the differing edit scripts correspond to the same LCS, by calculating the LCS as a string and then deriving maximum-length runs of deletions and insertions from that. But I'm not sure how fast that can actually be, and it still wouldn't address the general case.

make-github-pseudonymous-again commented 8 months ago

@caub So fast-diff does not satisfy your requirements either. Is there an implementation out there that does?

make-github-pseudonymous-again commented 8 months ago

The only way to do it is to check every possible shortest edit script, which requires quadratic time and memory.

@gliese1337 Doesn't Myers algo require quadratic time/space in the worst case anyway? Do cases where there are many minimum edit scripts candidates also correspond to some form of worst case (i.e. the example I gave above).

gliese1337 commented 8 months ago

@make-github-pseudonymous-again That depends on which version of Myers you implement; the paper has a quadratic space algorithm similar to Hirschberg, but then goes on to describe the linear space refinement. This implementation is based on the linear space refinement, and is O(min(N,M) + D) in space and O(min(N,M) * D) in time, where N and M are the input lengths and D is the number of differences found. Having many different possible edit scripts is not a factor--the algorithm only identifies one, and remains unaware of the existence of others.

caub commented 8 months ago

Ok interesting, so each lib may perform better on different inputs

Feel free to close so, thanks for having a detailed look at this!

make-github-pseudonymous-again commented 8 months ago

@gliese1337 Thanks for the clarification on space vs time. My question was whether the relationship between number of sets of edit rectangles with minimum D and space/time of enumerating those had been studied.

gliese1337 commented 8 months ago

@make-github-pseudonymous-again I did some looking, but the only examples of enumerating all possible LCSs/SESs involve constructing the entire NxM table and then reading out all of the paths, which requires O(NM) space and O(NM+DC) time (where D is the number of differences output, and C is the count of alternate edit scripts). I can imagine a variation on the Myers algorithm that would only require linear space, but only at the cost of a lot of repeated calculation, blowing up the run time. My guess would be that you could probably manage O(min(N,M) * DC) time complexity, but I'd have to work the algorithm out in detail to be sure, and that would be quite an effort....

rizrmd commented 8 months ago

Ok interesting, so each lib may perform better on different inputs

Feel free to close so, thanks for having a detailed look at this!

I'm wondering if we can unify all of those lib. Apply some heuristic to choose the best diff algorithm for the input.

Or just do it naively, put a timeout for each algorithm. The one that does not timeout is the winner.

caub commented 8 months ago

Just describing my use case, which is more visual diffs, we need to switch between 2 modes:

For now we just use 1st mode for short values, but ideally we would do something better, I've looked at similarity indexes, and tried also to make mine, but it can fail in some cases. Anyway, it's ok as-is for us, just wanted to let you know

make-github-pseudonymous-again commented 8 months ago

we need to switch between 2 modes

Myers' algorithm is ideal for this as you can decide on the cutoff value in advance and avoid having the algorithm degenerate to quadratic in case the inputs are too different.

caub commented 8 months ago

Interesting, but I haven't seen any of those libs (fast-diff, fast-myers-diff, fast-array-diff, ..) allow to specify such cutoff value

gliese1337 commented 8 months ago

I'm not sure about other libraries, but for this one there is no need to provide a cutoff parameter--since results incrementally calculated and provided via generator, just stop reading the results when you hit whatever limit you want.

I suppose it might be possible to achieve better efficiency if the intended cutoff were available internally, but that approach ought to suffice for most cases.

make-github-pseudonymous-again commented 8 months ago

@gliese1337 If the inputs do not have any tokens in common nothing will be output until each pair of input tokens is tested. Unless you skip merging adjacent rectangles after the recursion bottoms up I guess.