Open williamstein opened 1 year ago
Another comment -- there is a typescript rewrite of diff-match-patch here: https://github.com/google/diff-match-patch/pull/74
It would be cool to create a new github repo based on that, which has my fix(es) to dmp.js in it, plus that, and is targeted at Javascript only. It could also maybe have one wasm version at some point, if performance can be made better than javascript along some dimensions.
I was worried the typescript version might vanish, so I made a clone here: https://github.com/williamstein/diff-match-patch-typescript
OK, I looked into this:
This code has some easily copyable characters for testing: https://github.com/google/diff-match-patch/pull/80/files#diff-35eb933acfa17a64ac26d2b0c14fa8bb548ebdb1922321dc716491df14090442
Definitely when making diffs/patches, our current dmp version does treat the two surrogate pairs as just separate characters, i.e., the string s="🌈" is viewed as a string of length 2.
If you change that green chinese character to the red one, here's what patch we get:
[[[[0,"\ud83c"],[-1,"\ude2f"],[1,"\ude32"]],0,0,2,2]]
I.e., it treats both strings as two characters and transforms one to the other.
And in all cases editing in cocalc this works fine, at least if the patch applies cleanly . After all the patch is a function to transform one string to another, and it doesn't matter how that string is interpreted.
If there were a merge conflict, i.e., the patch is not applied cleanly, it seems like we could definitely end up with something weird, e.g., an invalid unicode character.
So this problem does in fact potentially impact us, but ONLY when there are two people editing the same text at the same time, when there would likely be something mangled anyways.
It seems like for some users of dmp who use the algorithm differently, they always run into huge problems due to this bug. But not us.
Still, I definitely would like to fix this someday.
@rgbkrk I finally looked into this.
I don't know whether, or to what extent, this might impact cocalc. I've so far never been aware of such an issue. Maybe (?) when CoCalc hits it, an error is thrown, and our diff algorithm generates a very large diff that is just "replace the entire document by this other one", so for us things are not efficient, but not broken either. I don't know. It also might be very unlikely to hit in the context of Jupyter notebooks, where most text is ascii, and markdown where we usually write emojis as
:thing:
instead of unicode.In any case, I'll look into this and report back here. Since the original author of DMP doesn't maintain it anymore, it could also make sense to try to modernize the library and make a new independent supported version, which contains fixes for the above issue. As some motivation, the @cocalc/util package has a copy of dmp with at least one important bugfix (from my point of view). That's in https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/dmp.js
NOTE: the file dmp.js had a GPL header applied to it by some automated script that @haraldschilly wrote. However, I just fixed that and reverted the license header back to the original Apache V2 license.