microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
161.77k stars 28.44k forks source link

Single Unicode Mathematical Letter counted as two #87868

Closed konn closed 4 years ago

konn commented 4 years ago

Steps to Reproduce:

  1. Create new document
  2. Input a single unicode mathematical script small l: 𝓁
  3. Invoke Go to line... to go to :1:2 (character 2 of line 1).
  4. A cursor must be placed 1-letter after 𝓁, but actually 1-letter before it!

Does this issue occur when all extensions are disabled?: Yes.

Range type and setDecoration function in VSCode API also seems works incorrectly with such letters too. Since such math letters are in wide use in the realm of proof assistants, such as Agda and/or Lean, it becomes really difficult to implement extensions to work with them. Indeed, this bug was found when I tried to implement Agda client for vscode, which uses location information provided by the compiler to do the correct syntax colouring (Since Agda has really flexible syntax extension mechanism, we need to communicate with the compiler to get precise syntax highlighting).

It seems that unicode math characters with the following kind of character name suffers from this issue:

alexdima commented 4 years ago

@konn

The vscode API uses UTF16 offsets for Position and Range. These have been chosen because they reflect the offsets in native JavaScript strings.

For example, 𝓁, MATHEMATICAL SCRIPT SMALL L' (U+1D4C1) is represented in the following way in JavaScript strings (UTF16):

console.log(`𝓁`.length); // 2
console.log(`𝓁`.charCodeAt(0)); // 55349 --- 0xD835 
console.log(`𝓁`.charCodeAt(1)); // 56513 --- 0xDCC1 

The vscode API works with UTF16 positions and ranges, so here is how the API can represent positions in a file with 𝓁 as the single content:

new vscode.Position(0,0); // the position before `𝓁`
new vscode.Position(0,1); // an invalid position which will be coerced to (0,0) or (0,2)
new vscode.Position(0,2); // the position after `𝓁`

new vscode.Range(0,0,0,2); // the range encompassing `𝓁`
new vscode.Range(0,0,0,1); // an invalid range that will be coerced to (0,0,0,2)
new vscode.Range(0,1,0,2); // an invalid range that will be coerced to (0,0,0,2)
new vscode.Range(0,1,0,1); // an invalid range that will be coerced to (0,0,0,0) or (0,2,0,2)

Does that answer your questions?

konn commented 4 years ago

Thank you for reply! And sorry for my unclear issue message.

Yes, this behaviour is caused by the difference of code point (or visible character; in this case U+1D4C1) and code unit (0xD835 and 0xDCC1). The point is, some language provides location information by code point length and others by code unit. In particular, Agda (and Haskell's standard Text type) provides, as a location indicator, the number of code points as length information; on the other hand, Node.js returns that of code units. This difference forces me to adopt the dirty hack to convert between those two different conceptions of "length". (I had to extract the whole strings from the document, converting it to an array of code points, slice, and then use String.prototype.length: [...document.getText()].slice(0, n).join("").length;, which is apparently inefficient).

As an extension writer, I think, at least:

(Strictly speaking, there is yet another factor of character: grapheme, composed of multiple code units. These could also be considered as yet another candidate of range specification unit)

In addition, users facing editor window regards 𝓁 as a single letter; and the current behaviour of word counter and Go to Line...` should be confusing.

alexdima commented 4 years ago

I don't quite understand your conversion logic. I would have expected that converting between utf32 offsets and utf16 offsets does not require allocation. e.g.:

/**
 * Loop through a JS string and log the code points and the code points offsets
 */
function loopCodePoints(str) {
  for (let i = 0, codePointCount = 0, len = str.length; i < len; i++, codePointCount++) {
    const charCode = str.charCodeAt(i);
    if (0xD800 <= charCode && charCode <= 0xDBFF && i + 1 < len) {
      // this character is a high surrogate
      const nextCharCode = str.charCodeAt(i + 1);
      if (0xDC00 <= nextCharCode && nextCharCode <= 0xDFFF) {
        // the next character is a low surrogate
        i++;
        const codePoint = ((charCode - 0xD800) << 10) + (nextCharCode - 0xDC00) + 0x10000;
        console.log(`codePoint@${codePointCount}: ${codePoint}`);
      }
    } else {
      const codePoint = charCode;
      console.log(`codePoint@${codePointCount}: ${codePoint}`);
    }
  }
}

Looking forward to your thoughts.

konn commented 4 years ago

I don't quite understand your conversion logic. I would have expected that converting between utf32 offsets and utf16 offsets does not require allocation. e.g.:

Thank you for providing a neat example! My dirty workaround was intended to do the following things:

  1. Suppose one wants to get the UTF-16 offset of the n-th Unicode code point in the string str.
  2. Use [... str] notation to split strings into an array of characters (not code units);
  3. Then slice it to truncate the array to n element;
  4. join the array into a string and use length property to get corresponding UTF-16 offset.

This approach includes, as you pointed out, unnecessary extra allocations, and hence the logic you provided seems much preferable to me. Thanks again!

  • Generally, the VS Code API tries to avoid utility APIs because the entire npm ecosystem is available to extensions already.

By the way, I have another concern about the efficiency, and I think it is worth providing dedicated API in that respect. If I understand correctly, document.getText(range?) method returns a copy of the entire (or subtext in the specified range) content of the document. So, even with your code-unit offset detection logic, one has to allocate the copy of the document content. And, to be efficient, one has to maintain and update some kind of cache of the information about such correspondence according to TextDocumentChangeEvent; and such update procedure needs another allocation of the text in updated range and re-calculation of the point-unit offset correspondence. I think it can become a little tedious if one has to do such extra hack every time one has to do with a language with a different conception of character offset different to vscode's. Since VSCode itself has all the needed information, I think it is ideal to provide such a boilerplate logic as a default API.

  • I agree that the API doc should be improved. Would you be willing to create a PR to improve it?

I will try later. Is it microsoft/vscode-docs repo that I have to make such PRs, right?

  • At that time (a few months ago), we have chosen to render code points in the status bar (except when Tab is involved, which is counted as 1...tabSize depending on where it occurs)

Oh, I didn't know that. It seems the latest Stable release (1.41.1) and Insiders release (1.42.0-insider) shows Ln 1, Col 2 in the status bar if one creates a file which contains 𝓁 only. Perhaps that change is not yet released, right?

  • I am not sure what kind of encoding "Go to..." should accept. Should it accept utf32 offsets, grapheme offsets, utf16 offsets? Mostly for historical reasons, it now chooses to treat the input as a utf16 offset.

Yes, that is another tough problem to settle. I think we should use grapheme offset for "Go To..." because the program source code can contain string literals (or perhaps identifiers) which contain (not necessarily expressed as a single UTF-16 code-unit) graphemes. Even if it is much harder to do so, I think we have to at least use UTF-32 offset (code point offset) for "Go To..." command. There is not-so-few number of programming languages which allows (and even recommends) to use Unicode characters for keywords and/or variable/function identifiers. Amongst them are, for example, Haskell (1, 2), Agda, Lean, Coq and so on. Since there are such languages with Unicide identifiers, I think we should at least use code-point offset for "Go To...".

alexdima commented 4 years ago
vscodebot[bot] commented 4 years ago

This issue has been closed automatically because it needs more information and has not had recent activity. See also our issue reporting guidelines.

Happy Coding!