Single Unicode Mathematical Letter counted as two

VSCode Version: 1.41.1
OS Version: Darwin x64 19.2.0 (macOS Catalina 10.15.2)

Steps to Reproduce:

Create new document
Input a single unicode mathematical script small l: 𝓁
Invoke Go to line... to go to :1:2 (character 2 of line 1).
A cursor must be placed 1-letter after 𝓁, but actually 1-letter before it!

Does this issue occur when all extensions are disabled?: Yes.

Range type and setDecoration function in VSCode API also seems works incorrectly with such letters too. Since such math letters are in wide use in the realm of proof assistants, such as Agda and/or Lean, it becomes really difficult to implement extensions to work with them. Indeed, this bug was found when I tried to implement Agda client for vscode, which uses location information provided by the compiler to do the correct syntax colouring (Since Agda has really flexible syntax extension mechanism, we need to communicate with the compiler to get precise syntax highlighting).

It seems that unicode math characters with the following kind of character name suffers from this issue:

Mathematical Bold Capital/Small letters (𝐀, 𝐁, 𝐂, ..., 𝐚, 𝐛, 𝐜, ...)
Mathematical Italic Capital/Small letters (𝐴𝐵𝐶... 𝑎𝑏𝑐...)
Mathematical Bold Italic Capital/Small letters (𝑨𝑩𝑪... 𝒂𝒃𝒄...)
Mathematical Script Capital/Small letters (𝒜𝒞𝒟... 𝒶𝒷𝒸...)

@konn

The vscode API uses UTF16 offsets for Position and Range. These have been chosen because they reflect the offsets in native JavaScript strings.

For example, 𝓁, MATHEMATICAL SCRIPT SMALL L' (U+1D4C1) is represented in the following way in JavaScript strings (UTF16):

console.log(`𝓁`.length); // 2
console.log(`𝓁`.charCodeAt(0)); // 55349 --- 0xD835 
console.log(`𝓁`.charCodeAt(1)); // 56513 --- 0xDCC1

The vscode API works with UTF16 positions and ranges, so here is how the API can represent positions in a file with 𝓁 as the single content:

new vscode.Position(0,0); // the position before `𝓁`
new vscode.Position(0,1); // an invalid position which will be coerced to (0,0) or (0,2)
new vscode.Position(0,2); // the position after `𝓁`

new vscode.Range(0,0,0,2); // the range encompassing `𝓁`
new vscode.Range(0,0,0,1); // an invalid range that will be coerced to (0,0,0,2)
new vscode.Range(0,1,0,2); // an invalid range that will be coerced to (0,0,0,2)
new vscode.Range(0,1,0,1); // an invalid range that will be coerced to (0,0,0,0) or (0,2,0,2)

Does that answer your questions?

Thank you for reply! And sorry for my unclear issue message.

Yes, this behaviour is caused by the difference of code point (or visible character; in this case U+1D4C1) and code unit (0xD835 and 0xDCC1). The point is, some language provides location information by code point length and others by code unit. In particular, Agda (and Haskell's standard Text type) provides, as a location indicator, the number of code points as length information; on the other hand, Node.js returns that of code units. This difference forces me to adopt the dirty hack to convert between those two different conceptions of "length". (I had to extract the whole strings from the document, converting it to an array of code points, slice, and then use String.prototype.length: [...document.getText()].slice(0, n).join("").length;, which is apparently inefficient).

As an extension writer, I think, at least:

The API Doc should clarify that the length of strings means the code-unit length of UTF-16, and
Provides a utility APIs...
- to create Ranges from code point location in addition to the current code-unit based one, and
- to convert from code-unit length to code-point length and vice versa.

(Strictly speaking, there is yet another factor of character: grapheme, composed of multiple code units. These could also be considered as yet another candidate of range specification unit)

In addition, users facing editor window regards 𝓁 as a single letter; and the current behaviour of word counter and Go to Line...` should be confusing.

I don't quite understand your conversion logic. I would have expected that converting between utf32 offsets and utf16 offsets does not require allocation. e.g.:

/**
 * Loop through a JS string and log the code points and the code points offsets
 */
function loopCodePoints(str) {
  for (let i = 0, codePointCount = 0, len = str.length; i < len; i++, codePointCount++) {
    const charCode = str.charCodeAt(i);
    if (0xD800 <= charCode && charCode <= 0xDBFF && i + 1 < len) {
      // this character is a high surrogate
      const nextCharCode = str.charCodeAt(i + 1);
      if (0xDC00 <= nextCharCode && nextCharCode <= 0xDFFF) {
        // the next character is a low surrogate
        i++;
        const codePoint = ((charCode - 0xD800) << 10) + (nextCharCode - 0xDC00) + 0x10000;
        console.log(`codePoint@${codePointCount}: ${codePoint}`);
      }
    } else {
      const codePoint = charCode;
      console.log(`codePoint@${codePointCount}: ${codePoint}`);
    }
  }
}

I agree that the API doc should be improved. Would you be willing to create a PR to improve it?
Generally, the VS Code API tries to avoid utility APIs because the entire npm ecosystem is available to extensions already.
I agree that things can get complicated around graphemes. Only recently we implemented Unicode's grapheme breaking rules.
At that time (a few months ago), we have chosen to render code points in the status bar (except when Tab is involved, which is counted as 1...tabSize depending on where it occurs)
I am not sure what kind of encoding "Go to..." should accept. Should it accept utf32 offsets, grapheme offsets, utf16 offsets? Mostly for historical reasons, it now chooses to treat the input as a utf16 offset.

Looking forward to your thoughts.

I don't quite understand your conversion logic. I would have expected that converting between utf32 offsets and utf16 offsets does not require allocation. e.g.:

Thank you for providing a neat example! My dirty workaround was intended to do the following things:

Suppose one wants to get the UTF-16 offset of the n-th Unicode code point in the string str.
Use [... str] notation to split strings into an array of characters (not code units);
Then slice it to truncate the array to n element;
join the array into a string and use length property to get corresponding UTF-16 offset.

This approach includes, as you pointed out, unnecessary extra allocations, and hence the logic you provided seems much preferable to me. Thanks again!

Generally, the VS Code API tries to avoid utility APIs because the entire npm ecosystem is available to extensions already.

By the way, I have another concern about the efficiency, and I think it is worth providing dedicated API in that respect. If I understand correctly, document.getText(range?) method returns a copy of the entire (or subtext in the specified range) content of the document. So, even with your code-unit offset detection logic, one has to allocate the copy of the document content. And, to be efficient, one has to maintain and update some kind of cache of the information about such correspondence according to TextDocumentChangeEvent; and such update procedure needs another allocation of the text in updated range and re-calculation of the point-unit offset correspondence. I think it can become a little tedious if one has to do such extra hack every time one has to do with a language with a different conception of character offset different to vscode's. Since VSCode itself has all the needed information, I think it is ideal to provide such a boilerplate logic as a default API.

I agree that the API doc should be improved. Would you be willing to create a PR to improve it?

I will try later. Is it microsoft/vscode-docs repo that I have to make such PRs, right?

At that time (a few months ago), we have chosen to render code points in the status bar (except when Tab is involved, which is counted as 1...tabSize depending on where it occurs)

Oh, I didn't know that. It seems the latest Stable release (1.41.1) and Insiders release (1.42.0-insider) shows Ln 1, Col 2 in the status bar if one creates a file which contains 𝓁 only. Perhaps that change is not yet released, right?

I am not sure what kind of encoding "Go to..." should accept. Should it accept utf32 offsets, grapheme offsets, utf16 offsets? Mostly for historical reasons, it now chooses to treat the input as a utf16 offset.

Yes, that is another tough problem to settle. I think we should use grapheme offset for "Go To..." because the program source code can contain string literals (or perhaps identifiers) which contain (not necessarily expressed as a single UTF-16 code-unit) graphemes. Even if it is much harder to do so, I think we have to at least use UTF-32 offset (code point offset) for "Go To..." command. There is not-so-few number of programming languages which allows (and even recommends) to use Unicode characters for keywords and/or variable/function identifiers. Amongst them are, for example, Haskell (1, 2), Agda, Lean, Coq and so on. Since there are such languages with Unicide identifiers, I think we should at least use code-point offset for "Go To...".

The text of the document is already allocated and available in the extension host process. Most JS VMs, including v8, are quite smart and don't allocate strings all the time, but it all depends on how the strings are constructed, e.g. str1 + str2 returns very often a ConcatString which simply points to the first and second strings that were concatenated. That being said, I believe the best way in this case would be to use TextDocument.lineAt and work only with the text on a line. If your language server returns positions as [line, codePointOffset] you could get ahold of the line text and convert offsets as needed inside that line. Unless you work with super long lines, the performance should be fine I believe.
Building a cache that gets updated as document content changes come in is a very interesting exercise, but I have to be honest and say that I don't believe we will be adding this utility / conversion API unless we have lots of folks really really needing it. Even then, we would probably guide them to share a node module between them and only if a common cache would really pay off, would we consider building this in our core. The reason is that we want to have a lean API that covers the most common use cases. We do not want to cover 100% of use cases if there are ways to avoid that. Also, there is nothing more we have inside VS Code, so IMHO the difficulty of converting from UTF16 to UTF32 offsets is the same even if we were to implement this in VS Code.
1.41.1 contains the change I refer to. When opening a file containing a single 𝓁 inside, the status bar will render (1,1) at the beginning of the file and (1,2) after the 𝓁 character. When selecting the 𝓁 character, the status bar will render 1 selected. The status bar does not begin counting at 0 like our API because most humans don't like to work with that (the first line in a file is rendered as 1 in all of the code editors I know).
I agree that there are compilers that use code point offsets when reporting errors, but we must work with what is currently most likely to be happening. We can definitely enrich the "Go to..." to accept code point offsets, but how should that be designed and what should be the default -- accepting utf16 offsets or utf32 offsets is IMHO unclear.

This issue has been closed automatically because it needs more information and has not had recent activity. See also our issue reporting guidelines.

Happy Coding!

microsoft / vscode

Single Unicode Mathematical Letter counted as two #87868