Closed emilyyyylime closed 7 months ago
Thanks for raising this! Could you share the text from the screenshot, and I'll see if I can reproduce this.
uh ¿Qué hace él?
I guess
I believe you may have overcompensated with your fix, using combining characters such as U+0303 Combining Tilde (with 'e' displays as ẽ
) — the highlight appears too early.
It seems to me, skimming over the LSP specification^1, that the offset is not measured in grapheme clusters as implemented in your fix, but rather by code unit offsets, with the client able to negotiate the representation (UTF-8 / UTF-16 / UTF-32) with the server.
I don't have permission to reopen this issue; I'd appreciate if you could.
Ah good catch .. I can see ẽ
is rendered with a width of two characters in vscode but typos-vscode is counting it as one (i think).
@tekumara it should render as a single grapheme. However, what's happening here as I understand is that you used unicode-segmentation
in your fix to count grapheme clusters, however the LSP protocol specifies that character offsets should be negotiated between the server and the client and reported using the negotiated representation thereafter. So if UTF-16
were negotiated (which is required to be supported by all servers and would thus be fully compatible on its own), you would return the number of UTF-16 units the characters encode into (1 for each character below U+D800 and 2 for characters after that).[^1] If UTF-8
were negotiated you would return the number of bytes that would be needed to encode the text as UTF-8 (so 1 for the first 128 codepoints, 2 for the next 1920, and so on).[^2] For UTF-32
, you would simply use the number of code points.[^3]
[^1]: Rust: str.chars().map(char::len_utf16).sum()
[^2]: Rust: str.len()
[^3]: Rust: str.chars().count()
Thanks @emilyyyylime for the really helpful explanation. I've pushed a fix which counts utf-16 code units, since that is the default position encoding. The other position encodings aren't supported yet.
I'm using Helix, but nothing like this appears to be happening with other LSPs, so I imagine the issue is with Typos.
When a line has multi-byte codepoints, the highlighting appears in the wrong place; seemingly reporting byte offset when a character offset is expected: