google / diff-match-patch

Diff Match Patch is a high-performance library in multiple languages that manipulates plain text.
Apache License 2.0
7.25k stars 1.09k forks source link

Javascript line diff breaks beyond 65K lines #54

Open tkruse opened 5 years ago

tkruse commented 5 years ago

I try using The google diff-match-path library from nodejs for line diffs: https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs. I get wrong patches when in sum the lines of both inputs goes beyond 65,536 (2^16) lines.

Is that a bug (in my code or diff-match-patch), or am I hitting a known limitation of javascript/nodejs? Anything I can do to use d-m-p with larger files?

This script reproduces the problem

var diff_match_patch = require("diff-match-patch")

// function copied from google wiki 
// https://github.com/google/diff-match-patch/wiki/Line-or-Word-Diffs
function diff_lineMode(text1, text2) {
  var dmp = new diff_match_patch();
  var a = dmp.diff_linesToChars_(text1, text2);
  var lineText1 = a.chars1;
  var lineText2 = a.chars2;
  var lineArray = a.lineArray;
  var diffs = dmp.diff_main(lineText1, lineText2, false);
  dmp.diff_charsToLines_(diffs, lineArray);
  return diffs;
}

// reproduce problem by diffing string with many lines to "abcd"
for (let size = 65534; size < 65538; size += 1) {
  let text1 = "";
  for (let i = 0; i < size; i++) {
    text1 += i + "\n";
  }

  var patches = diff_lineMode(text1, "abcb")
  console.log("######## Size: " + size + ": patches " + patches.length)
  for (let i = 0; i < patches.length; i++) {
    // patch[0] is action, patch[1] is value
    var action = patches[i][0] < 0 ? "remove" : (patches[i][0] > 0 ? "add" : "keep")
    console.log("patch" + i + ": " + action + "\n" + patches[i][1].substring(0, 10))
  }
}

Giving these outputs (using substring in code above to shorten outputs):

######## Size: 65534: patches 2
patch0: remove
0
1
2
3
4

patch1: add
abcb
######## Size: 65535: patches 2
patch0: remove
0
1
2
3
4

patch1: add

######## Size: 65536: patches 2
patch0: keep
0

patch1: remove
1
2
3
4
5

######## Size: 65537: patches 3
patch0: remove
0

patch1: keep
1

patch2: remove
2
3
4
5
6

Using

$ node --version v6.3.1
cat package.json
{
  "name": "dmp_bug",
  "version": "1.0.0",
  "description": "reproduce issue with diff match patch",
  "main": "dmpbug.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "dependencies": {
    "diff-match-patch": "^1.0.4"
  }
}
NeilFraser commented 5 years ago

The line/word patch mechanism presented here is to encode each unique line or word as a unique Unicode character. In ES5 only the first 16 bits are accessible (using String.fromCharCode).

However, from ES6, one can use String.fromCodePoint which should allow for 21 bits: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/fromCodePoint

Try replacing String.fromCharCode with String.fromCodePoint and charCodeAt with codePointAt.

tkruse commented 5 years ago

Thanks, for the quick reply. I wonder if the library could detect an overflow of unique lines and error out instead of producing invalid patches. Feel free to close if nothing is to be done about it.

NeilFraser commented 5 years ago

I'll leave this issue open, with the intention of adding a check for the existence of fromCodePoint and codePointAt. This way it will still run on ES5, but will have added capability if run on ES6+.

tkruse commented 5 years ago

Also note that e.g. the java and python implementation seem to suffer the same problem, and they both have similar handling:

python2 https://github.com/google/diff-match-patch/blob/895a9512bbcee0ac5a8ffcee36062c8a79f5dcda/python2/diff_match_patch.py#L429

 if len(lineArray) == maxLines:
        # Bail out at 65535 because unichr(65536) throws.
        line = text[lineStart:]
        lineEnd = len(text)
lineArray.append(line)

java https://github.com/google/diff-match-patch/blob/895a9512bbcee0ac5a8ffcee36062c8a79f5dcda/java/src/name/fraser/neil/plaintext/diff_match_patch.java#L547

if (lineArray.size() == maxLines) {
      // Bail out at 65535 because
      // String.valueOf((char) 65536).equals(String.valueOf(((char) 0)))
      line = text.substring(lineStart);
      lineEnd = text.length();
}

Same for dart.

kevin-lindsay-1 commented 3 years ago

@NeilFraser can the fromCodePoint and codePointAt solution be added as an option? I use this library via npm, and that sounds like a great option to have.

AvshT commented 4 months ago

@NeilFraser Thank you for your solution. It actually works with up to 2^21 file rows. By simply replacing String.fromCharCode with String.fromCodePoint and String.charCodeAt with String.codePointAt, and changing maxLines to 2M, the magic happens, and big file comparisons are now functional.

@NeilFraser or @google-admin, considering that all browsers support ES6, could you please verify and release a fix for the rest of the world?

Note: Upon reviewing, your diff implementation appears to be the fastest on the market. Thanks again.