jupyter-server / pycrdt

CRDTs based on Yrs.
https://jupyter-server.github.io/pycrdt
MIT License
41 stars 10 forks source link

Fix Index Conversion from Text to TextRef #129

Open jbdyn opened 2 months ago

jbdyn commented 2 months ago

Hey @davidbrochart :wave:

I played around with some emojis in Text and noticed that insertion is working different than expected:

🐍 test script ```python from pycrdt import Doc, Text ## setup ydoc = Doc() ytext = Text() ydoc["text"] = ytext state = "" # track state of ytext def callback(event): """Print change record""" global state new_state = str(event.target) delta = str(event.delta) print(f"{delta}: '{state}' -> '{new_state}'") # update current state state = new_state ytext.observe(callback) ## Manipulate Text print("Insert and delete single emoji '🌴'") # works as expected ytext.insert(0, "🌴") assert state == "🌴" # given index is for Unicode code points # but callback returns length of individual bytes in delta del ytext[0:1] assert state == "" print("\nInsert '🌴abcde' sequentially") for c, char in enumerate("🌴abcde"): ytext.insert(c, char) assert state == "🌴abcde" ```
Insert and delete single emoji '🌴'
[{'insert': '🌴'}]: '' -> '🌴'
[{'delete': 4}, {'insert': ''}]: '🌴' -> ''

Insert '🌴abcde' sequentially
[{'insert': '🌴'}]: '' -> '🌴'
[{'retain': 4}, {'insert': 'a'}]: '🌴' -> '🌴a'
[{'retain': 4}, {'insert': 'b'}]: '🌴a' -> '🌴ba'
[{'retain': 4}, {'insert': 'c'}]: '🌴ba' -> '🌴cba'
[{'retain': 4}, {'insert': 'd'}]: '🌴cba' -> '🌴dcba'
[{'retain': 5}, {'insert': 'e'}]: '🌴dcba' -> '🌴decba'

In the Python code, one gives the index for Unicode code points, however

TextRef structure internally uses UTF-8 encoding and its length is described in a number of bytes rather than individual characters

[source]

So, I put in some thought to adapt the given index to the UTF-8 encoded string with this PR:

Insert and delete single emoji '🌴'
[{'insert': '🌴'}]: '' -> '🌴'
[{'delete': 4}]: '🌴' -> ''

Insert '🌴abcde' sequentially
[{'insert': '🌴'}]: '' -> '🌴'
[{'retain': 4}, {'insert': 'a'}]: '🌴' -> '🌴a'
[{'retain': 5}, {'insert': 'b'}]: '🌴a' -> '🌴ab'
[{'retain': 6}, {'insert': 'c'}]: '🌴ab' -> '🌴abc'
[{'retain': 7}, {'insert': 'd'}]: '🌴abc' -> '🌴abcd'
[{'retain': 8}, {'insert': 'e'}]: '🌴abcd' -> '🌴abcde'

However, I am not sure how to deal with the numbers returned in event.delta upon TextEvents, as they are also based on the UTF-8 encoded form and thereby can be off for the Python string representation. (My use case: keeping Text in sync with contents of the Textual TextArea widget.)

Should the user deal with that with own code? Should Text try to give the numbers for the Python string repr? Or should Text be capable of handling rich text as TextRef does:

TextRef offers a rich text editing capabilities (it’s not limited to simple text operations). Actions like embedding objects, binaries (eg. images) and formatting attributes are all possible using TextRef.

[source]

I also thought about limiting Text to inserted values for which len(val) == len(val.encode()), but this does not feel right to me.

davidbrochart commented 2 months ago

Hi @jbdyn,

Thanks for looking into this. Indeed Python's string encoding is Unicode and on the Rust side Yrs uses UTF-8. I'm not sure we should do any automatic conversion, because as you said the indices in the events are UTF-8 based. Maybe we should let the user deal with index conversion manually when dealing with these kinds of characters? For instance in your example you could do:

del ytext[0:len(str(ytext).encode())]
for c, char in enumerate("🌴abcde"):
    ytext.insert(len(str(ytext).encode()), char)

or just:

del ytext[:]
for char in "🌴abcde":
    ytext += char

In Yrs an offset_kind can be passed to a Doc, which can be Bytes or Utf16, but I'm not sure it would solve this issue?

jbdyn commented 2 months ago

Maybe we should let the user deal with index conversion manually when dealing with these kinds of characters?

This should be fine when the user knows that the index/slice given to Text corresponds to UTF-8 encoded content and not to Unicode. But if not, this can become a gotcha as Text seemingly behaves like a Python Unicode string str.

For this, however, having the raw encoded content of TextRef accessible via Python would be quite nice. With that, one does not need to manage an encoded version of a string alongside Text or has to re-encode the returned Unicode string on every insert/delete. My rationale:

Let (i, n) be the index and length, respectively, for Unicode and (j, m) for UTF-8. The tuple (i, n) is known by the user or used by the user-facing text editor, (j, m) is given in a TextEvent from TextRef. Let unicode = str(ytext) be the Unicode version of TextRef, the tuples can be converted with

# 1, Unicode -> UTF-8
j = len(unicode[:i].encode())
m = len(unicode[i:i+n].encode())

# 2, UTF-8 -> Unicode
i = len(utf8[:j].decode())
n = len(utf8[j:j+m].decode())

For the second part, if I would not have utf8 directly accessible from Text, I would need to do

# 2, UTF-8 -> Unicode
utf8 = unicode.encode()
i = len(utf8[:j].decode())
n = len(utf8[j:j+m].decode())

on every change. As the bytes are already there in TextRef, the user should not be forced to recompute it, right?

In Yrs an offset_kind can be passed to a Doc, which can be Bytes or Utf16, but I'm not sure it would solve this issue?

It seems not to. There is an example for the usage of OffsetKind. When I replace (3 bytes, single char width) with 🌴 (4 bytes, double char width), the indexing in TextRef::insert does not line up with the indexing in Python as shown here with an !:

// in Rust
"Hi ★! to you"
 ----^ index 4

"Hi 🌴! to you"
 -----^ index 5

In Python strings, however, the ! is in both cases on index 4:

star = "Hi ★! to you"
palm = "Hi 🌴! to you"

assert star[4] == "!"
assert palm[4] == "!"
jbdyn commented 2 months ago

Something else crossed my mind:

Instead of converting the indices via len(...encode()/decode()) on every change, one could maintain and update an array with chunk ranges. Thereby, the index given to Text is a "chunk index" and less encoding/decoding is needed (just for encoding inserted values, as I see it). So "Hi 🌴! to you" could be represented as

chunks = [
    (0, 1),
    (1, 2),
    (2, 3),
    (3, 7), # <- 🌴
    (7, 8),
    (8, 9),
    (9, 10),
    (10, 11),
    (11, 12),
    (12, 13),
    (13, 14),
    (14, 15)
]

Having that, we could do the following index conversion for single chunks

j, j_end = chunks[i]
m = j_end - j

# (j, m) corresponds to a specific chunk anyway, because the user or user-facing text-editor
# also only operates in chunks
i = chunks.index((j, j+m))
n = 1

This implementation here is highly inefficient on memory, but could be made sparse. Also it does not account for slices including multiple or nested chunks, but this should also be solvable.

I can imagine two ways getting and keeping chunks in sync with the content of Text/TextRef:

  1. Tree-Sitter might be suitable here as it can perform partial parsing and the queries can be defined by the user so that there is the possibility to have a custom "rich text codec" (How is formatting defined? Where does an image start and end? etc.).
  2. TextRef returns indices for Diffs, which are basically the chunks we are looking for (?).

This approach might seem a bit involved, but I still find it tempting to implement and at least wanted to share the idea.