kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.69k stars 491 forks source link

Non-Minimal Array Diff with Emtpy Lines at End #523

Closed tsmaeder closed 4 weeks ago

tsmaeder commented 1 month ago

Consider the following program:


const jsdiff= require('diff');

const s1= [
    'first line',
    '',
    '',
    'last line'
]
const s2= [
    'first line',
    'NEW TEXT',
    '',
    'last line',
    '',
    ''
];

const result= jsdiff.diffArrays(s1, s2);
console.log(JSON.stringify(result, undefined, 2));

If I run it in node, I get the following output:

[
  {
    "count": 1,
    "value": [
      "first line"
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      "NEW TEXT"
    ]
  },
  {
    "count": 1,
    "value": [
      ""
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      "last line"
    ]
  },
  {
    "count": 1,
    "value": [
      ""
    ]
  },
  {
    "count": 1,
    "removed": true,
    "value": [
      "last line"
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      ""
    ]
  }
]

Note that the output has 7 elements, including 4 which are actual changes. Now if I replace the last two lines in the array s2 with 'a' and 'b', I get an edit script like so:

[
  {
    "count": 1,
    "value": [
      "first line"
    ]
  },
  {
    "count": 1,
    "added": true,
    "value": [
      "NEW TEXT"
    ]
  },
  {
    "count": 1,
    "value": [
      ""
    ]
  },
  {
    "count": 1,
    "removed": true,
    "value": [
      ""
    ]
  },
  {
    "count": 1,
    "value": [
      "last line"
    ]
  },
  {
    "count": 2,
    "added": true,
    "value": [
      "a",
      "b"
    ]
  }
]

Not that this change is shorter both in total length and also number of additions/removals. It seems obvious to me that the script in the second case could also be written for the first case, so the edit script in the first case is not minimal. It seems odd that the algorithm should behave differently for empty and non-empty lines.

ExplodingCabbage commented 4 weeks ago

Not that this change is shorter both in total length and also number of additions/removals. It seems obvious to me that the script in the second case could also be written for the first case, so the edit script in the first case is not minimal.

No, it's the same length measured in number of lines added or removed, which is what jsdiff is optimising for. There are fewer change objects in the second version, but one of those change objects represents the insertion of 2 lines, and therefore we count it as 2 insertions.

It seems odd that the algorithm should behave differently for empty and non-empty lines.

There's nothing special about empty lines happening here; it's just that with the first diff you did, the algorithm can (and does) preserve both blank lines from s1, since there are (more than) 2 blank lines in s2, and in the second diff, it can't do that any more.

Having the library support a "gap penalty" rather than measuring edit cost purely by the count of individual tokens inserted or deleted is a valid feature request, and there's an issue for it at https://github.com/kpdecker/jsdiff/issues/167 (which remains open for now), but it's probably a major change; I don't know if the Myers diff algorithm can be tweaked to support such a thing so adding it might require adding support for entirely new diff algorithms with different performance characteristics.