Prioritize common prefix

dtolnay commented 5 years ago

- note: AAA
+ note: BBB CCC

In this diff, diffr aligns the space in the removed line with the space in the middle of the added line, rather than the space in the common prefix of both strings.

diffr

Is this a limitation of the algorithm or something that can be fixed on your end? It would be nicer to prioritize maximally long common prefixes so that we see "AAA" as removed and "BBB CCC" as added.

Another way to frame the problem would be to minimize the number of snakes. The above diff can be described with 1 snake but diffr_lib currently produces 2.

I haven't looked at the algorithm but it seems like a longest common subsequence tie being broken the wrong way, where maybe a simple heuristic for ties could produce much cleaner diffs.

dtolnay commented 5 years ago

Another example that comes out wrong:

diffr

Again it would be better to prioritize the common prefix and highlight 5 trailing adjacent ^ as added.

mookid commented 4 years ago

Hi @dtolnay

Thanks for the sugestion and the examples. I will take a look soon.

mookid commented 4 years ago

Greedily expanding the snakes as much as possible is never a pessimization, but can also miss global extrema.

In other words, given the result yielded by diffr-lib:

-[ note: ]AAA
+[ note:] BBB[ ]CCC

the second snake can be extended one byte to the left (reducing the count of snakes by one) and this is always a win:

+[ note: ]BBB CCC

The best solution to the problem of minimizing the number of snakes projecting to a given LCS would be better, of course. But it does not help in the second case (in the given example as in your proposal, the number of snakes is 1).

mookid commented 4 years ago

Hi @dtolnay,

I am working on an improvement; after computing the longest common susequence, I need to figure out the "best" partition of both parts of the diff (ie, the one minimizing the number of different segments). I'll let you know here when I have some code that you can test.

mookid commented 4 years ago

@dtolnay I wrote some code that corresponds to the spec I have in mind; please let me know if the result corresponds to something like what you are looking for.

dtolnay commented 4 years ago

Thanks @mookid, I tested it in https://github.com/dtolnay/trybuild/pull/27 against some projects and it is definitely a lot better. I think it is still possible to do better, like in my example of diff -u <(echo '^^^^^^^^^^') <(echo '^^^^^^^^^^^^^^^') above I believe most people would perceive this as characters added to the end of the line rather than characters inserted at the beginning, i.e. treating the common prefix as shared makes more sense. But at this point I would use this if you released it.

You wrote in https://github.com/mookid/diffr/pull/37#issue-345431751 that you are concerned about the performance. Can you give me an idea of how bad it is? What size of input are you testing on and how long does it take to compute the diff?

mookid commented 4 years ago

The worst case scenario I have seen is the fwllowing diff: https://github.com/git/git/commit/786dabecd4766bfda0083491ef543e3b5d1d9f39

for which the perf goes from ~13s to ~55s on my dev machine, which is pretty bad. In the wost case, the time of both postprocessing steps takes roughly the same time as the Myers algorithm.

I can still improve the postprocessing algorithm, but I think there will still be bad scenarios with that design. The alternative would be to merge tweak the Myers algorithm to yield the best split at the same time, but it's not easy.

About prioritizing the prefix or the suffix, it should be easy to do.

dtolnay commented 4 years ago

Thanks, that sounds like it wouldn't be a problem for my use case since the diffs I deal with are going to be ~40 lines max. I wasn't sure if you had some kind of quintic loop going on where a 40 lines diff could take multiple seconds. But obviously if you can improve the performance that would be great!

Thanks for your work on this!

mookid commented 4 years ago

Ok, good to know! I don't think it will be a problem for that use case.

mookid commented 4 years ago

I published diffr-lib 0.1.3. Please let me know how it works for you!

mookid / diffr

Prioritize common prefix #27