Closed dtolnay closed 4 years ago
Another example that comes out wrong:
Again it would be better to prioritize the common prefix and highlight 5 trailing adjacent ^
as added.
Hi @dtolnay
Thanks for the sugestion and the examples. I will take a look soon.
Greedily expanding the snakes as much as possible is never a pessimization, but can also miss global extrema.
In other words, given the result yielded by diffr-lib:
-[ note: ]AAA
+[ note:] BBB[ ]CCC
the second snake can be extended one byte to the left (reducing the count of snakes by one) and this is always a win:
+[ note: ]BBB CCC
The best solution to the problem of minimizing the number of snakes projecting to a given LCS would be better, of course. But it does not help in the second case (in the given example as in your proposal, the number of snakes is 1).
Hi @dtolnay,
I am working on an improvement; after computing the longest common susequence, I need to figure out the "best" partition of both parts of the diff (ie, the one minimizing the number of different segments). I'll let you know here when I have some code that you can test.
@dtolnay I wrote some code that corresponds to the spec I have in mind; please let me know if the result corresponds to something like what you are looking for.
Thanks @mookid, I tested it in https://github.com/dtolnay/trybuild/pull/27 against some projects and it is definitely a lot better. I think it is still possible to do better, like in my example of diff -u <(echo '^^^^^^^^^^') <(echo '^^^^^^^^^^^^^^^')
above I believe most people would perceive this as characters added to the end of the line rather than characters inserted at the beginning, i.e. treating the common prefix as shared makes more sense. But at this point I would use this if you released it.
You wrote in https://github.com/mookid/diffr/pull/37#issue-345431751 that you are concerned about the performance. Can you give me an idea of how bad it is? What size of input are you testing on and how long does it take to compute the diff?
The worst case scenario I have seen is the fwllowing diff: https://github.com/git/git/commit/786dabecd4766bfda0083491ef543e3b5d1d9f39
for which the perf goes from ~13s to ~55s on my dev machine, which is pretty bad. In the wost case, the time of both postprocessing steps takes roughly the same time as the Myers algorithm.
I can still improve the postprocessing algorithm, but I think there will still be bad scenarios with that design. The alternative would be to merge tweak the Myers algorithm to yield the best split at the same time, but it's not easy.
About prioritizing the prefix or the suffix, it should be easy to do.
Thanks, that sounds like it wouldn't be a problem for my use case since the diffs I deal with are going to be ~40 lines max. I wasn't sure if you had some kind of quintic loop going on where a 40 lines diff could take multiple seconds. But obviously if you can improve the performance that would be great!
Thanks for your work on this!
Ok, good to know! I don't think it will be a problem for that use case.
I published diffr-lib 0.1.3. Please let me know how it works for you!
In this diff, diffr aligns the space in the removed line with the space in the middle of the added line, rather than the space in the common prefix of both strings.
Is this a limitation of the algorithm or something that can be fixed on your end? It would be nicer to prioritize maximally long common prefixes so that we see "AAA" as removed and "BBB CCC" as added.
Another way to frame the problem would be to minimize the number of snakes. The above diff can be described with 1 snake but diffr_lib currently produces 2.
I haven't looked at the algorithm but it seems like a longest common subsequence tie being broken the wrong way, where maybe a simple heuristic for ties could produce much cleaner diffs.