kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.69k stars 491 forks source link

[HELP] Different result when changing order of parameters? #506

Closed jvalrog closed 3 months ago

jvalrog commented 3 months ago

We are calling diffLines like this:

diffLines(oldBackup.data, newBackup.data)

result is:

added: 465 deleted: 467

if we change the order of the parameters we get:

added: 466 deleted: 466

Is this normal?

ExplodingCabbage commented 3 months ago

Hmmmm...

I would need to see the texts being diffed and carefully work through what happened, but this sounds wrong to me. To be clear, it is normal for the diff of y vs x to not simply be the same as the diff of x vs y but with additions turned into deletions and deletions into additions; very roughly speaking, when faced with a tie between diffs of equal cost, the Myers diff algorithm prefers the option that puts deletions as early as possible, and this creates an asymmetry where the diff depends on your order of parameters. For instance, if you diff abcd against acbd with diffChars, the b will deleted and a new b inserted after the c, but if you reverse the order of parameters, it'll instead be the c that gets deleted and reinserted.

What you're showing is more than just the expected kind of asymmetry, though - it's a change in the overall count of deleted and added tokens. I don't think that should happen. The stats from the first diff imply that the newBackup.data has two fewer lines in it than oldBackup.data, while the stats from the second diff imply that the two files contain the same number of lines!

I wonder if this odd output somehow relates to the bug alluded to in the release notes for the not-yet-released jsdiff 6.0.0:

diffLines with ignoreWhitespace: true will no longer ignore the insertion or deletion of entire extra lines of whitespace at the end of the text. Previously, these would not show up as insertions or deletions, as a side effect of a hack in the base diffing algorithm meant to help ignore whitespace in diffWords.

I'd appreciate it if you could share the actual content of the files that you're diffing (or, if they've got confidential stuff in them, some kind of slimmed down or censored example that exhibits the same behaviour) so I can try to figure out what's going on.

jvalrog commented 3 months ago

sure! I'll remove any sensible information and share it later

jvalrog commented 3 months ago

Was able to reproduce it, I've attached a zip with 2 backups and a simple test.js script.

The execution shows this:

612
608
609
612

[removed link]

jvalrog commented 3 months ago

could you reproduce the problem?

ExplodingCabbage commented 3 months ago

Aha! It took me a long time to spot it, but the bug is in your test script rather than jsdiff, here. Each change object has a .count property indicating how many added/removed/unchanged tokens it represents, so to count the number of changes you need to sum the .count properties, not just count the change objects as you're doing. Here's a fixed version of the script:

fs = require('fs');
diff = require('./lib');

backup1File = fs.readFileSync('/home/mark/Downloads/test/backup1.txt');
backup2File = fs.readFileSync('/home/mark/Downloads/test/backup2.txt');

backup1 = backup1File.toString();
backup2 = backup2File.toString();

changes = diff.diffLines(backup1, backup2);

console.log(
    changes.filter((c) => c.added).reduce((total, c) => total + c.count, 0)
);
console.log(
    changes.filter((c) => c.removed).reduce((total, c) => total + c.count, 0)
);

changes = diff.diffLines(backup2, backup1);

console.log(
    changes.filter((c) => c.added).reduce((total, c) => total + c.count, 0)
);
console.log(
    changes.filter((c) => c.removed).reduce((total, c) => total + c.count, 0)
);

Output now has the expected symmetry:

mark@fractal:~/jsdiff$ node test.js 
2263
2165
2165
2263
jvalrog commented 3 months ago

thank you! I missed that point then, my bad.