Diff doesn't accurately minimize the number of diff hunks

mgerner commented 8 years ago

For some changes, diff hunks (groups of changed lines) can be computed in different ways. VS Code sometimes separates changes into different hunks when it arguably shouldn't, making viewing of diffs (slightly) more difficult.

VSCode Version: 1.4.0
OS Version: Debian stretch

Steps to Reproduce:

Create a file with the contents (note two blank lines) and commit it using e.g. git.

1

2

In the middle of the file, add the content

21

22

The diff will now be calculated as shown below (looking good):

Add a blank line to the beginning of the file. The first blank line will now show up as its own hunk (correctly), but the section that was added in the middle will be split into two hunks. Note that we have added two sections to our file, but the diff unnecessarily shows it as three sections having been added:

You can see an actual screenshot from code of mine below (there are other changes above it as well). In this code, it would be very good if we at a glance could see that there was one big change, but instead it (at a glance) looks like two changes. This is unnecessary.

alexdima commented 8 years ago

Technically both diffs are minimal w.r.t. the number of changed lines (from a LCS point of view), but not w.r.t. number of distinct diff regions. We could post-process the diff results to cover cases like this.

mgerner commented 8 years ago

Yes, that was my idea as well - introduce a very small cost associated with each distinct diff region, as you call them. That will ensure that the optimization (which I'm guessing is something based on dynamic programming?) will select the "better" diff.

mgerner commented 8 years ago

(By the way, I really wouldn't call this a feature request. It's clear that the current behaviour is incorrect, isn't it? Changing the top of a file shouldn't change how diffs for completely unrelated sections are calculated. It'd be better characterized as a bug -- and an easy-to-fix bug, at that.)

mgerner commented 8 years ago

Yet another example, that is even clearer: I added a new method to a Python file. This is as straight-forward as can be. What does the diff result look like?

The method I added have been broken up into three different diff sections, and the diff believes that I added my new method in the middle of an old method that was already present above the new one. This just looks bad, it's annoying, and it makes reading these diffs take longer than it should.

(I have to add though, that other than this, I absolutely love VS Code -- and the feeling in my office among those who use it is that it gives a good image of Microsoft.)

alexdima commented 8 years ago

I agree this is perceived as a bug, but technically it is a feature request :). The diffing algorithm is based at its core on solving multiple LCS (longest common substring) problems. The algorithm correctly finds a longest common substring sequence, it is just that in some cases there are multiple longest common substring sequences and some would feel "better" than others.

What I mean by that is that in your last example, return res is matched with a return res (so the algorithm identifies a valid longest common substring sequence), it is just not the most "human appeasing" match.

To get to the "best" longest common substring sequence, we need to write some new code that takes the LCS result and tries to "massage" it with some heuristics to transform it into a "better" LCS.

microsoft / vscode

Diff doesn't accurately minimize the number of diff hunks #11683