kpdecker / jsdiff

A javascript text differencing implementation.
BSD 3-Clause "New" or "Revised" License
7.92k stars 496 forks source link

Suboptimal (O(D^2)) memory usage #396

Open quark-zju opened 1 year ago

quark-zju commented 1 year ago

It seems https://github.com/kpdecker/jsdiff/blob/3b654c2ed7d5262ed9946de841ad8dae990286c7/src/diff/base.js does not implement "4b. A Linear Space Refinement" mentioned in Myers' paper. So the memory usage is O(D^2). From the code it looks like O(len(bestPath) * len(components in bestPath)) might be that complexity.

This makes it practically too slow to complete diff calculation if there are many changes. For example, comparing the prettified Hearthstone cards.json in different languages wouldn't complete in 20 minutes using jsdiff:

Reproduce:

# Prepare test data
wget https://api.hearthstonejson.com/v1/168129/enUS/cards.json -O 1
wget https://api.hearthstonejson.com/v1/168129/zhCN/cards.json -O 2
python3 -m json.tool < 1 > a
python3 -m json.tool < 2 > b
// a.js - Calculate diff
const fs = require('fs')
const diff = require('diff');

const a = fs.readFileSync('a', {encoding: 'utf-8'});
const b = fs.readFileSync('b', {encoding: 'utf-8'});

console.log(`pid: ${process.pid}`);
console.log(diff.diffLines(a, b).length);

Profiler shows that nodejs spent 70% time on GC:

image

As a comparison, it seems diff-match-patch implements the linear space refinement and can completes the above test case in ~24s. (For reference, git diff --stat --minimal completes in ~0.3s).

tomByrer commented 1 year ago

I'm curious to see a better algorithm...

ExplodingCabbage commented 8 months ago

Hmmmmm...

I definitely think it makes sense to implement a version of the algorithm based on the linear space refinement in which we explore the edit graph in both directions (from the top-left, and from the bottom-right) until they join up in the middle. This will have the practical effect of making jsdiff faster and reducing its memory usage (though without, I think, reducing its worst-case time or space complexity).

I think the actual linear space refinement proposed by Myers is different to (and cleverer than) this, though, and I'm not convinced of the wisdom of fully implementing it; certainly I think it shouldn't be on by default. If I'm understanding it correctly - and it took me a while to get this - the key idea of that refinement is that you can navigate through the edit graph from both sides without keeping track of the paths you've taken, just how far they've got in each diagonal, until your two paths hit each other somewhere in the middle of the edit graph, and then the snake they hit each other on is guaranteed to be the middle snake of the optimal diff, and then the way you compensate for not having kept a record of the paths you took is by recursively running the algorithm again on the regions before and after the middle snake to find the middle snakes of those, and so on recursively until all the middle snakes you've found together amount to a full path through the edit graph. Effectively you're repeating lots of work, roughly doubling the time taken in some cases, to compensate for not keeping a record of the work you've done.

From a computer scientist's perspective, judging performance only in terms of time and space complexity, this is a brilliant trick that gets the space complexity down without worsening the time complexity. But pragmatically speaking, I'm not convinced there's any point? I have encountered situations where jsdiff basically hung "forever" trying to compute a large diff but never any situation where it ran out of memory or even used a nontrivial amount of my computer's memory. Any tradeoff that involves making jsdiff slower in order to reduce its memory requirements intuitively feels like a bad tradeoff to me (though I am open to being persuaded otherwise). I'd be hesitant accept a contribution adding this as an optional off-by-default feature without some persuasion that it's worth it.

(Do note that the perf issues around clonePath that you highlight are already basically fixed by a separate change, https://github.com/kpdecker/jsdiff/pull/411.)

ExplodingCabbage commented 8 months ago

I'm gonna label this as planned for next release with the intent that I'll implement the idea of navigating the edit graph from both ends, but not actually implement the linear space algorithm.

quark-zju commented 8 months ago

Part of the space complexity can manifest as time complexity via the GC. The hung "forever" cases might be actually caused by the space complexity. I didn't observe "out of memory" error either.

The hearthstone cards.json example is a simplified version of a production slow diff case we observed in the past. Both are large generated JSON files. For that specific production case, while I cannot share the file contents, I can share some numbers: even git diff without --minimal took 5 seconds.

ExplodingCabbage commented 7 months ago

Part of the space complexity can manifest as time complexity via the GC

No, I don't think so - or at least, I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC. Every change object creation is already accounted for in the time complexity and there are only as many object garbage collections are there are object creations.

Again, do also note that https://github.com/kpdecker/jsdiff/pull/411 (merged to master but not yet released to npm) has drastically reduced the amount of garbage collection needed and completely eliminated clonePath.

quark-zju commented 7 months ago

I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC

For example, if you append N element to a linked list. It is O(N) time complexity since appending is O(1) per element. However, if for some reason the GC pauses O(N) times and had to scan the whole linked list each time, GC would cause a O(N^2) time complexity.

ExplodingCabbage commented 7 months ago

Ah, interesting. Yeah, that makes sense; I was assuming the cost of garbage collection would be proportional to the number of objects in need of collecting but of course that's not the case because of the cost of finding the items that can be collected, which I was completely ignoring. Hmm...

ExplodingCabbage commented 7 months ago

I think I'm persuaded that it'd make sense from a performance perspective to make this change, but one thing still gives me pause, which is that the linear space refinement from the paper gives different results to the core Myers algorithm. Consider diffing old text A against new text AAA, for instance. With the core Myers diff algorithm, we should keep the first A and then insert two As after it. But now consider following this algorithm for finding the middle snake:

image

Here's our edit graph:

image

At D=0, the forward and reverse paths both just preserve one A (i.e. move diagonally):

image

Then when D=1, we do an insertion on the forward path and similarly a vertical move on the reverse path. (I'm omitting paths that go off the edge of the edit graph, although technically we also do those if we're following the algorithm faithfully.)

image

Now the forward path is at (x, y) = (1, 2) and the reverse path is at (u, v) = (0, 1), satisfying the definition of the paths "overlapping" in diagonal k=-1. (Note we use D = 2 when we apply this definition, even though the variable D in the algorithm is 1 at this point; they're not the same D.)

image

But now, per the algorithm, "The last snake of the reverse path is the middle snake." - i.e. the empty snake from (0, 1) to (0, 1). We're thus now forced to build a 2-path that goes through (0, 1), even though the core Myers algorithm would not go through that point.

I was slightly doubting my own interpretation, but the interactive visualisation at https://blog.robertelder.org/diff-algorithm/ agrees:

image

So if we made this change we'd be modifying the results jsdiff returns. Do you see a tweak to the published refinement that would let us avoid this?

quark-zju commented 7 months ago

Git's xdiff does some post-processing (ex. shifting the diff chunk up or down, merging adjacent diff chunks) to "normalize" the results.

ExplodingCabbage commented 7 months ago

Hmm. I wonder if that consistently achieves results identical to the core Myers diff algorithm, though? (It's not immediately obvious to me whether there's any reasonable way to achieve that at all.) Would need to scrutinise the post-processing logic carefully...