is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Yes, but not much.

williamstein commented 1 year ago

"Surrogate pairs are strings that contain a supplemental code point (especially emojis) that cause diff indices to be offset. It can either mess up the text or cause DMP to error (within toDelta/fromDelta)." See

I don't know whether, or to what extent, this might impact cocalc. I've so far never been aware of such an issue. Maybe (?) when CoCalc hits it, an error is thrown, and our diff algorithm generates a very large diff that is just "replace the entire document by this other one", so for us things are not efficient, but not broken either. I don't know. It also might be very unlikely to hit in the context of Jupyter notebooks, where most text is ascii, and markdown where we usually write emojis as :thing: instead of unicode.

In any case, I'll look into this and report back here. Since the original author of DMP doesn't maintain it anymore, it could also make sense to try to modernize the library and make a new independent supported version, which contains fixes for the above issue. As some motivation, the @cocalc/util package has a copy of dmp with at least one important bugfix (from my point of view). That's in https://github.com/sagemathinc/cocalc/blob/master/src/packages/util/dmp.js

NOTE: the file dmp.js had a GPL header applied to it by some automated script that @haraldschilly wrote. However, I just fixed that and reverted the license header back to the original Apache V2 license.

williamstein commented 1 year ago

Another comment -- there is a typescript rewrite of diff-match-patch here: https://github.com/google/diff-match-patch/pull/74

It would be cool to create a new github repo based on that, which has my fix(es) to dmp.js in it, plus that, and is targeted at Javascript only. It could also maybe have one wasm version at some point, if performance can be made better than javascript along some dimensions.

williamstein commented 1 month ago

I was worried the typescript version might vanish, so I made a clone here: https://github.com/williamstein/diff-match-patch-typescript

williamstein commented 1 month ago

OK, I looked into this:

This code has some easily copyable characters for testing: https://github.com/google/diff-match-patch/pull/80/files#diff-35eb933acfa17a64ac26d2b0c14fa8bb548ebdb1922321dc716491df14090442
- E.g., 🈯 and 🈲 and 🌈
- Notice how in Javascript these surrogate pairs are treated (i.e., as two separate characters):
Definitely when making diffs/patches, our current dmp version does treat the two surrogate pairs as just separate characters, i.e., the string s="🌈" is viewed as a string of length 2.
If you change that green chinese character to the red one, here's what patch we get:
```
[[[[0,"\ud83c"],[-1,"\ude2f"],[1,"\ude32"]],0,0,2,2]]
```

I.e., it treats both strings as two characters and transforms one to the other.

And in all cases editing in cocalc this works fine, at least if the patch applies cleanly . After all the patch is a function to transform one string to another, and it doesn't matter how that string is interpreted.

If there were a merge conflict, i.e., the patch is not applied cleanly, it seems like we could definitely end up with something weird, e.g., an invalid unicode character.

So this problem does in fact potentially impact us, but ONLY when there are two people editing the same text at the same time, when there would likely be something mangled anyways.

It seems like for some users of dmp who use the algorithm differently, they always run into huge problems due to this bug. But not us.

Still, I definitely would like to fix this someday.

williamstein commented 1 month ago

@rgbkrk I finally looked into this.

sagemathinc / cocalc

is our realtime sync implementation for strings impacted by an issue with surrogate pairs in diff match patch? Yes, but not much. #6327