haskell / haskell-language-server

Official haskell ide support via language server (LSP). Successor of ghcide & haskell-ide-engine.
Apache License 2.0
2.65k stars 354 forks source link

Unicode symbols >= 𐀀 are broken #2646

Open Bodigrim opened 2 years ago

Bodigrim commented 2 years ago

Your environment

Which OS do you use: MacOS Which LSP client (editor/plugin) do you use: Sublime Code

Steps to reproduce

aaa :: Word 
aaa = 10

𐀀𐀀𐀀 :: Word 
𐀀𐀀𐀀 = 10 

Expected behaviour

Both aaa and 𐀀𐀀𐀀 should be underlined as unused bindings.

Actual behaviour

aaa is underlined correctly, but only the first character of 𐀀𐀀𐀀 is underlined.

ΠΈΠ·ΠΎΠ±Ρ€Π°ΠΆΠ΅Π½ΠΈΠ΅

That's because HLS does not distinguish positions returned from GHC (which are in code points = characters) from positions mandated by LSP (which are UTF-16 code units). Basically, GHC says us that 3 first code points in line 5 are an unused binding. Each 𐀀 is a single character, but 2 UTF-16 code units, so HLS should ask LSP to underline first 6 code units. Instead of this HLS asks LSP to underline only 3, and since the 3rd one is in the middle of the 2nd character, only the 1st character gets underlined.

CC @michaelpj @alanz, this is related to https://github.com/haskell/lsp/pull/392#discussion_r790164996.

Bodigrim commented 2 years ago

Code actions are also broken. E. g., choosing "Delete '𐀀𐀀𐀀'" results in the following program

aaa :: Word 
aaa = 10 10

That's again because HLS meant to ask LSP to delete 8 characters, but asked to delete 8 code points, 6 of which were 𐀀𐀀𐀀.

michaelpj commented 2 years ago

Thanks for the reproduction. To be clear, did you mean that this is related to the issue with splitting in the middle of code units, or is it this one? https://github.com/haskell/lsp/pull/392#discussion_r785379400

That is, do you think this will be fixed by the changes you made to lsp, or is there still something broken?

jneira commented 2 years ago

In any case a regression test (or several of them) here exercising unicode symbols would be great

Bodigrim commented 2 years ago

@michaelpj This is unlikely to be fixed in recent lsp patches. Essentially there must be a place, responsible for (fast) conversion between positions in code points and positions in code units. I think I can provide such low-level routine from text-rope and then someone will have to migrate ghcide to it.

@jneira I’m sorry, I’m unlikely to have capacity to add a regression test. Could someone else possibly please pick it up?

michaelpj commented 2 years ago

I'll look into it (inc tests), I just thought you might already have a hunch as to where the problem lies.

michaelpj commented 2 years ago

Looking at this more, I think this is utterly broken throughout the codebase. We frequently change from GHC SrcSpans to Positions, blindly assuming that the numbers can just be translated between them, which is very wrong. And we use the numbers from Position with functions from Text, which is also wrong.

If I understand correctly, you can't even convert from a postion-in-code-points to a position-in-code-units without having the whole text in question to hand, which seems quite annoying. In that case it probably would be useful to have such a function in text-rope and we'll just have to painfully audit HLS for direct use of SrcSpans.

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

michaelpj commented 2 years ago

Not really sure what the best way to tackle this systematically is. We could have a CodeUnit newtype in lsp that we use for the column offset, that might force us to look at all the use sites. I guess I can't argue that the Text methods should take a CodePoint newtype instead of Int :)

michaelpj commented 2 years ago

Vague plan:

Bodigrim commented 2 years ago

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

Already there: it provides both character-based API (Data.Text.Lines / Data.Text.Rope) and UTF16-based API (Data.Text.Utf16.Lines / Data.Text.Utf16.Rope). I just need to provide conversions.

Bodigrim commented 2 years ago

I realised that text-rope already provides a way to convert between positions: one can Data.Text.Lines.splitAtPosition with character-based counter and then call Data.Text.Utf16.Lines.lengthAsPosition on the prefix to receive UTF-16-based counter. I might provide a dedicated helper in the future, but this would work for a proof of concept.

Looking on ghcide, I think it can really benefit from using Data.Text.Lines from text-rope instead of a plain Text, because this will improve ubiquitous line splitting.

So I'd suggest to wait for an lsp release with text-rope-based VFS, then take another look at this issue.

michaelpj commented 2 years ago

Yes, definitely not planning to do anything before then. Thanks for letting me know how to do the conversion!

Ptival commented 1 month ago

Maybe the example in this issue makes this look too benign, but this bug makes Emacs w/ Haskell LSP completely unusable on projects that use Unicode. For instance with this file:

{-# LANGUAGE UnicodeSyntax #-}
module Mini where
type 𝐿 a = [a]

Once you start editing the last line (e.g. try to replace 𝐿 with a normal L and then back), the LSP backend and what's in your buffer start becoming completely out of sync, and you get error messages such as parse errors or text that is no longer present in your buffer.

michaelpj commented 1 month ago

Yeah, I would expect "unusable" to be about right.

I've been surprised that there hasn't been more complaint on this issue, and I've guessed that means that Unicode just isn't that popular with Haskell developers.

That said, I don't think this would be too hard to fix, just tedious. So if someone is keen to see it done then that could be great!

Ptival commented 1 month ago

I could be bothered enough to attempt a fix over the weekend, though I'm not quite sure where to begin.

Do you think the solution would look more or less like chasing the points at which positions are being exchanged between haskell-language-server and lsp, and figure out the correct invocation of your codePointPositionToPosition and positionToCodePointPosition functions, and its possible consequences on the nearby code?

michaelpj commented 1 month ago

Yes, that's pretty much it. I would probably use hlint to help: add an error rule for the use of the Position constructor directly, make a whitelist, and then start whittling it down. That way we can also hopefully stop it getting worse :sweat_smile:

Ptival commented 1 month ago

Almost immediately running into an odd situation. Consider:

https://github.com/haskell/haskell-language-server/blob/9565d0b2d0b7d2ddf5a982269c103b6fd0a781a0/ghcide/src/Development/IDE/GHC/Error.hs#L89-L91

My gut feeling is this is the place that is ripe for column confusion: supposedly, the srcLocCol is 1-based and code point-based, while Position is 0-based and code unit-based. However, this is nowhere near a VirtualFile.

Chasing up the call chain, I often end up at a StringBuffer as the thing looking the most like the contents of a file, e.g.:

https://github.com/haskell/haskell-language-server/blob/9565d0b2d0b7d2ddf5a982269c103b6fd0a781a0/ghcide/src/Development/IDE/Core/Preprocessor.hs#L146-L147 where the pragmas in a file are being checked and catchSrcErrors eventually calls realSrcLocToPosition, or:

https://github.com/haskell/haskell-language-server/blob/9565d0b2d0b7d2ddf5a982269c103b6fd0a781a0/ghcide/src/Development/IDE/Core/Compile.hs#L174-L189 where maybe the module contents are available in the ms_hspp_buf field of the ParsedModule...

So:

  1. Should I try to grab and pass around these StringBuffers all the way down? Or is this a recipe for disaster?
  2. If so, should this call for versions of your helper converter functions that just take in the contents, without the extra LSP-related stuff in a VirtualFile?
  3. And finally, these are StringBuffers, not Ropes. What would be the expectation here?

Overall, I get the feeling that the parts of the GHCIDE code I'm looking at are fairly LSP agnostic, so reaching the gap to the VirtualFile abstraction feels odd.

Any thoughts?