jupyter / jupyter_client

Jupyter protocol client APIs
https://jupyter-client.readthedocs.io
BSD 3-Clause "New" or "Revised" License
388 stars 283 forks source link

incorrect cursor_pos for completion requests with non-BMP Unicode characters #259

Closed stevengj closed 7 years ago

stevengj commented 7 years ago

As discussed in JuliaLang/IJulia.jl#541, it appears that Jupyter is sending an incorrect cursor_pos for completion requests when the string contains non-BMP Unicode characters.

For example, if I try to tab-complete the a cell containing 𝐚𝐚𝐚𝐚𝐚, it sends cursor_pos=10 even though the string has just 5 characters.

For these these non-BMP characters, the UTF-16 encoding used by JavaScript uses two 16-bit code units per character, rather than one as for most characters. So, maybe JavaScript's Unicode encoding is leaking into the messages (contrary to the spec, which explicitly says that cursor_pos is in "unicode characters").

Carreau commented 7 years ago

Hum, thanks for catching this. You would have asked me, I would have said it was bytes, not characters.

(BTW heard good things about the talk you gave today, seem like you never stop to code , even during conferences. )

minrk commented 7 years ago

This is indeed a bug in CodeMirror itself, where getCursor() counts 𝐚 as having length 2. Most other multi-byte characters I have tried correctly have length 1 (e.g. é or あ). I think CodeMirror just needs to learn about some more combining characters. From this issue, CodeMirror has to handle all combining characters, etc. itself since the underlying javascript is no help. My guess is this is just a unicode combining-range that CM doesn't know about yet.

stevengj commented 7 years ago

@minrk, I think you are confusing combining characters with surrogate pairs.

JavaScript (hence CodeMirror) uses the UTF-16 encoding to store Unicode strings: each string is stored as an array of 16-bit "code units", so every character is "multi-byte". Originally, people thought 16 bits would be enough to encode all of Unicode, and hence each code unit would equal one character. Unfortunately, people eventually realized that 16 bits was not enough, and newer characters (non-BMP characters) are encoded as two 16-bit code units (a "surrogate pair").

Both (U+3042) and é (U+00e9) are in the BMP, and hence encode as one 2-byte code unit, whereas 𝐚 (U+01d41a) is a newer addition (non-BMP) and requires two 2-byte code units (a surrogate pair) in UTF-16.

String indices and string lengths in JavaScript are measured in 16-bit code units, not characters, and apparently this is what CodeMirror is reporting.

stevengj commented 7 years ago

(This is one of the reasons why UTF-16 sucks: non-BMP characters are rare enough that bugs like this go undetected for a long time, especially since most programmers don't understand the encoding.)

stevengj commented 7 years ago

Fortunately, it is simple to write a conversion routine that calculates character indices from UTF-16 code-unit indices and vice versa. I will do that in IJulia for now as a workaround, and presumably you will want to do something like that in Jupyter as well. (Or in CodeMirror? But since JavaScript "natively" wants to use UTF-16 indices, they may be reluctant to change.)

stevengj commented 7 years ago

Note that Python has a mess here, too: in Python 2, len(u'𝐚𝐚𝐚𝐚𝐚') is 10 (on systems where Python uses UTF-16), while in Python 3.3+ len(u'𝐚𝐚𝐚𝐚𝐚') is 5.

minrk commented 7 years ago

@stevengj thanks for clarifying! I think it makes the most sense for the protocol spec to be 'characters', so if CodeMirror indices are returning UTF-16 units, it ought to be the responsibility of the Jupyter javascript to deal with surrogate pairs and turn that into actual character offsets.

minrk commented 7 years ago

Should be fixed by https://github.com/jupyter/notebook/pull/2509 if I understood the spec correctly. I was able to reproduce these errors using the IPython kernel with 𝐚𝐚𝐚𝐚𝐚, which that PR fixes.

minrk commented 7 years ago

5.2 spec is published with this in jupyter-client 5.1.