Wrong unicode handling?

emilyyyylime commented 7 months ago

I'm using Helix, but nothing like this appears to be happening with other LSPs, so I imagine the issue is with Typos.

When a line has multi-byte codepoints, the highlighting appears in the wrong place; seemingly reporting byte offset when a character offset is expected: Image showing the word Spanish word 'hace' being recognised as a misspelling of 'have', and highlighting being two characters too far.

tekumara commented 7 months ago

Thanks for raising this! Could you share the text from the screenshot, and I'll see if I can reproduce this.

emilyyyylime commented 7 months ago

uh ¿Qué hace él? I guess

emilyyyylime commented 7 months ago

I believe you may have overcompensated with your fix, using combining characters such as U+0303 Combining Tilde (with 'e' displays as ẽ) — the highlight appears too early. It seems to me, skimming over the LSP specification^1, that the offset is not measured in grapheme clusters as implemented in your fix, but rather by code unit offsets, with the client able to negotiate the representation (UTF-8 / UTF-16 / UTF-32) with the server.

I don't have permission to reopen this issue; I'd appreciate if you could.

tekumara commented 7 months ago

Ah good catch .. I can see ẽ is rendered with a width of two characters in vscode but typos-vscode is counting it as one (i think).

emilyyyylime commented 7 months ago

@tekumara it should render as a single grapheme. However, what's happening here as I understand is that you used unicode-segmentation in your fix to count grapheme clusters, however the LSP protocol specifies that character offsets should be negotiated between the server and the client and reported using the negotiated representation thereafter. So if UTF-16 were negotiated (which is required to be supported by all servers and would thus be fully compatible on its own), you would return the number of UTF-16 units the characters encode into (1 for each character below U+D800 and 2 for characters after that).[^1] If UTF-8 were negotiated you would return the number of bytes that would be needed to encode the text as UTF-8 (so 1 for the first 128 codepoints, 2 for the next 1920, and so on).[^2] For UTF-32, you would simply use the number of code points.[^3]

[^1]: Rust: str.chars().map(char::len_utf16).sum() [^2]: Rust: str.len() [^3]: Rust: str.chars().count()

tekumara commented 7 months ago

Thanks @emilyyyylime for the really helpful explanation. I've pushed a fix which counts utf-16 code units, since that is the default position encoding. The other position encodings aren't supported yet.

tekumara / typos-lsp

Wrong unicode handling? #22