kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.75k stars 491 forks source link

Flip core algorithm so everything is no longer the mirror image of Myers's paper #440

Closed ExplodingCabbage closed 7 months ago

ExplodingCabbage commented 7 months ago

Until now, jsdiff's entire implementation has been kind of symmetrically flipped from the algorithm described in the Myers paper. In Myers' paper, each column in the edit graph corresponds to a character in the OLD text (and hence horizontal movements correspond to deletions), while each ROW corresponds to a character in the NEW text (and hence vertical movements correspond to insertions). See page 4:

... each horizontal edge to point (x,y) corresponds to the delete command ‘‘x D’’; and a sequence of vertical edges from (x,y) to (x,z) corresponds to the insert command ...

The numbers of diagonals, meanwhile, are such that a larger diagonal number corresponds to having a higher x - i.e. having done more deletions:

Number the diagonals in the grid of edit graph vertices so that diagonal k consists of the points (x,y) for which x − y = k

Note that this is the opposite of how diagonals are numbered in jsdiff:

addPath = bestPath[diagonalPath - 1]
removePath = bestPath[diagonalPath + 1],

The algorithm on page 6 shows us recording only the greatest x coordinate reached on each diagonal in array V, and shows us choosing whether to reach a diagonal via a vertical or horizontal move as follows:

If k = −D or k ≠ D and V[k − 1] < V[k + 1] Then
    x ← V[k + 1]
Else
    x ← V[k − 1]+1

x remaining the same (the top case) corresponds to a vertical move, i.e. an insertion. So the logic here says to do an insertion if the path on the more-deletion-heavy side of our target diagonal has made it further through the old string. This means we break ties by preferring to add an insertion on the end of a path that has previously done more deletions rather than adding a deletion on the end of a path which has previously done more insertions.

jsdiff does the mirror image of all of this. Its diagonalPath numbers go up as you do insertions, not deletions. In bestPath it stores objects that have a newPos property, corresponding to the y coordinate on a Myers edit graph, not the x. And it breaks ties in favour of doing a deletion, not an insertion.

This makes jsdiff's diffs differ from the diffs of more popular/canonical tools like the Unix diff command, which break ties in favour of doing insertions (i.e. in favour of putting deletions earlier). It also makes it confusing to compare this library to other Myers diff implementations or to apply optimizations suggested in Myers's paper (or elsewhere).

This PR rewrites the algo to better match the paper (though it introduces a hack, noted with a TODO, to preserve the incorrect tiebreak behaviour for now; fixing that is a breaking change and so will go out in a later release).