Open Bodigrim opened 2 years ago
Code actions are also broken. E. g., choosing "Delete 'πππ'" results in the following program
aaa :: Word
aaa = 10 10
That's again because HLS meant to ask LSP to delete 8 characters, but asked to delete 8 code points, 6 of which were πππ
.
Thanks for the reproduction. To be clear, did you mean that this is related to the issue with splitting in the middle of code units, or is it this one? https://github.com/haskell/lsp/pull/392#discussion_r785379400
That is, do you think this will be fixed by the changes you made to lsp
, or is there still something broken?
In any case a regression test (or several of them) here exercising unicode symbols would be great
@michaelpj This is unlikely to be fixed in recent lsp
patches. Essentially there must be a place, responsible for (fast) conversion between positions in code points and positions in code units. I think I can provide such low-level routine from text-rope
and then someone will have to migrate ghcide
to it.
@jneira Iβm sorry, Iβm unlikely to have capacity to add a regression test. Could someone else possibly please pick it up?
I'll look into it (inc tests), I just thought you might already have a hunch as to where the problem lies.
Looking at this more, I think this is utterly broken throughout the codebase. We frequently change from GHC SrcSpan
s to Position
s, blindly assuming that the numbers can just be translated between them, which is very wrong. And we use the numbers from Position
with functions from Text
, which is also wrong.
If I understand correctly, you can't even convert from a postion-in-code-points to a position-in-code-units without having the whole text in question to hand, which seems quite annoying. In that case it probably would be useful to have such a function in text-rope
and we'll just have to painfully audit HLS for direct use of SrcSpan
s.
(Maybe text-rope
should allow using Position
s defined in terms of code-points also, might be more useful for clients other than LSP...)
Not really sure what the best way to tackle this systematically is. We could have a CodeUnit
newtype in lsp
that we use for the column offset, that might force us to look at all the use sites. I guess I can't argue that the Text
methods should take a CodePoint
newtype instead of Int
:)
Vague plan:
CodePointPosition
to the VFS, which is a mirror of Position
with code points for the column offset. Perhaps also CodePointRange
, CodePointLocation
VFS -> URI -> CodePointPosition -> Position
, etc.(Maybe
text-rope
should allow usingPosition
s defined in terms of code-points also, might be more useful for clients other than LSP...)
Already there: it provides both character-based API (Data.Text.Lines
/ Data.Text.Rope
) and UTF16-based API (Data.Text.Utf16.Lines
/ Data.Text.Utf16.Rope
). I just need to provide conversions.
I realised that text-rope
already provides a way to convert between positions: one can Data.Text.Lines.splitAtPosition
with character-based counter and then call Data.Text.Utf16.Lines.lengthAsPosition
on the prefix to receive UTF-16-based counter. I might provide a dedicated helper in the future, but this would work for a proof of concept.
Looking on ghcide
, I think it can really benefit from using Data.Text.Lines
from text-rope
instead of a plain Text
, because this will improve ubiquitous line splitting.
So I'd suggest to wait for an lsp
release with text-rope
-based VFS, then take another look at this issue.
Yes, definitely not planning to do anything before then. Thanks for letting me know how to do the conversion!
Maybe the example in this issue makes this look too benign, but this bug makes Emacs w/ Haskell LSP completely unusable on projects that use Unicode. For instance with this file:
{-# LANGUAGE UnicodeSyntax #-}
module Mini where
type πΏ a = [a]
Once you start editing the last line (e.g. try to replace πΏ
with a normal L
and then back), the LSP backend and what's in your buffer start becoming completely out of sync, and you get error messages such as parse errors or text that is no longer present in your buffer.
Yeah, I would expect "unusable" to be about right.
I've been surprised that there hasn't been more complaint on this issue, and I've guessed that means that Unicode just isn't that popular with Haskell developers.
That said, I don't think this would be too hard to fix, just tedious. So if someone is keen to see it done then that could be great!
I could be bothered enough to attempt a fix over the weekend, though I'm not quite sure where to begin.
Do you think the solution would look more or less like chasing the points at which positions are being exchanged between haskell-language-server and lsp, and figure out the correct invocation of your codePointPositionToPosition
and positionToCodePointPosition
functions, and its possible consequences on the nearby code?
Yes, that's pretty much it. I would probably use hlint
to help: add an error
rule for the use of the Position
constructor directly, make a whitelist, and then start whittling it down. That way we can also hopefully stop it getting worse :sweat_smile:
Almost immediately running into an odd situation. Consider:
My gut feeling is this is the place that is ripe for column confusion: supposedly, the srcLocCol
is 1-based and code point-based, while Position
is 0-based and code unit-based. However, this is nowhere near a VirtualFile
.
Chasing up the call chain, I often end up at a StringBuffer
as the thing looking the most like the contents of a file, e.g.:
https://github.com/haskell/haskell-language-server/blob/9565d0b2d0b7d2ddf5a982269c103b6fd0a781a0/ghcide/src/Development/IDE/Core/Preprocessor.hs#L146-L147
where the pragmas in a file are being checked and catchSrcErrors
eventually calls realSrcLocToPosition
, or:
https://github.com/haskell/haskell-language-server/blob/9565d0b2d0b7d2ddf5a982269c103b6fd0a781a0/ghcide/src/Development/IDE/Core/Compile.hs#L174-L189
where maybe the module contents are available in the ms_hspp_buf
field of the ParsedModule
...
So:
StringBuffer
s all the way down? Or is this a recipe for disaster?VirtualFile
?StringBuffer
s, not Rope
s. What would be the expectation here?Overall, I get the feeling that the parts of the GHCIDE code I'm looking at are fairly LSP agnostic, so reaching the gap to the VirtualFile
abstraction feels odd.
Any thoughts?
More trail of destruction: https://gitlab.haskell.org/ghc/ghc/-/issues/25396
Your environment
Which OS do you use: MacOS Which LSP client (editor/plugin) do you use: Sublime Code
Steps to reproduce
Expected behaviour
Both
aaa
andπππ
should be underlined as unused bindings.Actual behaviour
aaa
is underlined correctly, but only the first character ofπππ
is underlined.That's because HLS does not distinguish positions returned from GHC (which are in code points = characters) from positions mandated by LSP (which are UTF-16 code units). Basically, GHC says us that 3 first code points in line 5 are an unused binding. Each
π
is a single character, but 2 UTF-16 code units, so HLS should ask LSP to underline first 6 code units. Instead of this HLS asks LSP to underline only 3, and since the 3rd one is in the middle of the 2nd character, only the 1st character gets underlined.CC @michaelpj @alanz, this is related to https://github.com/haskell/lsp/pull/392#discussion_r790164996.