Closed stevengj closed 7 years ago
Hum, thanks for catching this. You would have asked me, I would have said it was bytes, not characters.
(BTW heard good things about the talk you gave today, seem like you never stop to code , even during conferences. )
This is indeed a bug in CodeMirror itself, where getCursor()
counts 𝐚
as having length 2. Most other multi-byte characters I have tried correctly have length 1 (e.g. é
or あ). I think CodeMirror just needs to learn about some more combining characters. From this issue, CodeMirror has to handle all combining characters, etc. itself since the underlying javascript is no help. My guess is this is just a unicode combining-range that CM doesn't know about yet.
@minrk, I think you are confusing combining characters with surrogate pairs.
JavaScript (hence CodeMirror) uses the UTF-16 encoding to store Unicode strings: each string is stored as an array of 16-bit "code units", so every character is "multi-byte". Originally, people thought 16 bits would be enough to encode all of Unicode, and hence each code unit would equal one character. Unfortunately, people eventually realized that 16 bits was not enough, and newer characters (non-BMP characters) are encoded as two 16-bit code units (a "surrogate pair").
Both あ
(U+3042) and é
(U+00e9) are in the BMP, and hence encode as one 2-byte code unit, whereas 𝐚
(U+01d41a) is a newer addition (non-BMP) and requires two 2-byte code units (a surrogate pair) in UTF-16.
String indices and string lengths in JavaScript are measured in 16-bit code units, not characters, and apparently this is what CodeMirror is reporting.
(This is one of the reasons why UTF-16 sucks: non-BMP characters are rare enough that bugs like this go undetected for a long time, especially since most programmers don't understand the encoding.)
Fortunately, it is simple to write a conversion routine that calculates character indices from UTF-16 code-unit indices and vice versa. I will do that in IJulia for now as a workaround, and presumably you will want to do something like that in Jupyter as well. (Or in CodeMirror? But since JavaScript "natively" wants to use UTF-16 indices, they may be reluctant to change.)
Note that Python has a mess here, too: in Python 2, len(u'𝐚𝐚𝐚𝐚𝐚')
is 10 (on systems where Python uses UTF-16), while in Python 3.3+ len(u'𝐚𝐚𝐚𝐚𝐚')
is 5.
@stevengj thanks for clarifying! I think it makes the most sense for the protocol spec to be 'characters', so if CodeMirror indices are returning UTF-16 units, it ought to be the responsibility of the Jupyter javascript to deal with surrogate pairs and turn that into actual character offsets.
Should be fixed by https://github.com/jupyter/notebook/pull/2509 if I understood the spec correctly. I was able to reproduce these errors using the IPython kernel with 𝐚𝐚𝐚𝐚𝐚, which that PR fixes.
5.2 spec is published with this in jupyter-client 5.1.
As discussed in JuliaLang/IJulia.jl#541, it appears that Jupyter is sending an incorrect
cursor_pos
for completion requests when the string contains non-BMP Unicode characters.For example, if I try to tab-complete the a cell containing
𝐚𝐚𝐚𝐚𝐚
, it sendscursor_pos=10
even though the string has just 5 characters.For these these non-BMP characters, the UTF-16 encoding used by JavaScript uses two 16-bit code units per character, rather than one as for most characters. So, maybe JavaScript's Unicode encoding is leaking into the messages (contrary to the spec, which explicitly says that
cursor_pos
is in "unicode characters").