kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.92k stars 496 forks source link

Question: Why ONE letter matters in my specific case #400

Closed fisker closed 8 months ago

fisker commented 1 year ago

Prettier uses diffArrays() to track the removed character. Recently we found a special case that diffArrays() generate very different result. I'm not familiar how this module works, but I'm able to narrow it down to a small reproduce.

import { diffArrays } from "diff";

const code = `
// All m are represented in JSON.
// So, the prettierd.py controls a subprocess which spawns "node {this_file}".
import {} from "fs";
`.trim();

function diff(string) {
  return diffArrays([...'import {  } from "fs";'], [...string]);
}

console.log([diff(code), diff(code.replace("m", "a"))]); // Change the `m` letter to `a` in first line

The result

[
  [
    { count: 1, added: undefined, removed: true, value: [Array] },
    { count: 7, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 7, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 22, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 7, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 1, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 8, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 1, added: undefined, removed: true, value: [Array] },
    { count: 8, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 1, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 40, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 9, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 3, added: true, removed: undefined, value: [Array] },
    { count: 10, value: [Array] }
  ],
  [
    { count: 25, added: true, removed: undefined, value: [Array] },
    { count: 1, value: [Array] },
    { count: 88, added: true, removed: undefined, value: [Array] },
    { count: 7, value: [Array] },
    { count: 2, added: undefined, removed: true, value: [Array] },
    { count: 12, value: [Array] }
  ]
]

More details https://github.com/prettier/prettier/issues/14811

ExplodingCabbage commented 8 months ago

There's nothing incorrect happening here. The Myers diff algorithm tries to find the diff that requires the minimum total number of insertions and deletions to transform the first text into the second one. But sometimes there are two totally different possible diffs between two texts that have almost the same total number of insertions and deletions, and therefore a one-character change to one of the texts can tip the balance and totally change the optimal result. The example you've found is one of those times.

Below I've made visualisations - in the "sequence alignment" style that bioinformaticians use - of two different optimal diffs; I think these might help. Both of these are possible optimal diffs for the task represented by diff(code) above, i.e. without the m getting replaced with an a. I replaced the newlines with | to let me write the whole text on one line. + symbols in the bottom row indicate where a character gets added, - characters in the top row represent where one gets deleted.)

Text 1:

import {  } from "fs";

Text 2:

// All m are represented in JSON.|// So, the prettierd.py controls a subprocess which spawns "node {this_file}".|import {} from "fs";

First, here's the obvious diff to a human: insert the first two lines before the start of text 1, then delete the two spaces from inside { }.

// All m are represented in JSON.|// So, the prettierd.py controls a subprocess which spawns "node {this_file}".|import {--} from "fs";
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++import {  } from "fs";

In this case we only needed two deletions; we preserved all the other characters from our starting text.

Second, here's the less obvious but equally optimal (from a indel-minimisation perspective) diff that jsdiff finds, where it drops just the i and { from the original text and greedily uses the other characters as soon as they appear in the new text:

-// All m are represented in JSON.|// So, the prettierd.py -controls a subprocess which spawns "node {this_file}".|import {} from "fs";
i+++++++m+++++++p++++++++++++++++++++++o+++++++r+t++++++++ {++++++++ + ++++++++++++++++++++++++++++++++++++++++}+++++++++ +++from "fs";

Note that this approach still only required two deletions! So from the perspective of Myers diff, it's no worse than the human-friendly diff I showed. And the Myers algorithm in effect breaks ties by trying to get deletions in as early as possible - not by trying to group deletions or insertions into large continguous blocks.

From the final visualisation above, you can probably see why replacing the m with an a makes such a big difference - the longer text no longer contains an m except in the word import near the end, so in order to use something approximating the diff in my second visualisation, you'd need one more deletion - deleting the m from import in the shorter text - and that would increase the edit distance by 1, making such a diff score worse than the human-intuitive diff. So that one-character change makes it better to do something roughly along the lines of the human-friendly diff.

I can understand why for Prettier's purposes, it often might be preferable to have a diff that is in some way biased in favour of not breaking up words with insertions or deletions (like the +s in the middle of import in my final visualisation above), even if this results in a slightly greater edit distance. jsdiff isn't doing anything wrong by not delivering the behaviour that would be ideal for Prettier, though, since jsdiff claims to implement the Myers diff algorithm and that algorithm is just meant to give results with the minimum number of insertions and deletions - which it's done successfully in this case.

fisker commented 8 months ago

Thank you for the explanation, learned. @ExplodingCabbage