Closed michaelpj closed 2 years ago
@Bodigrim I attempted to follow the recipe you gave me here, but I'd appreciate a look to make sure I didn't do it wrong!
The process does seem to be quite expensive as it stands: both cases have to do a Rope1.fromText . Rope2.toText
conversion, and going from code points to code units requires doing it twice! Awful.
Ah! The UTF-16 TextLines
is just a newtype wrapper around the other one! So these conversions should be free, but the constructor isn't exported. Either exporting it or providing some conversion functions would be very helpful!
Would it be useful to track if the file consists entirely of latin-1
or whatever subset of codepoints there are such that the two notions of position are equivalent, and if so, avoid doing all the conversion work in those cases?
Ah! The UTF-16
TextLines
is just a newtype wrapper around the other one! So these conversions should be free, but the constructor isn't exported. Either exporting it or providing some conversion functions would be very helpful!
Please try using an unreleased version of text-rope
. It makes TextLines
interchangeable.
source-repository-package
type: git
location: https://github.com/Bodigrim/text-rope
The next step on my side would be to make a Rope
type, allowing native indexing in both ways without conversions. The question is: do we foresee a use case, where a client is genuinely interested in UTF-16 positions?
@wz1000 let's make it correct first, and care about optimizations later. Unless you deal with extremely long lines (not files), this should not be a bottleneck.
Would it be useful to track if the file consists entirely of latin-1
Maybe! That would also have downsides - it would be some amount of ongoing work every time the file changes.
I agree with Bodigrim, basically: let's do it this way and optimise if it turns out to be a bottleneck.
The next step on my side would be to make a Rope type, allowing native indexing in both ways without conversions.
Well, since the rope types are interchangeable now, the current approach doesn't seem too bad. But maybe you have an idea for how to make it nicer or faster!
The question is: do we foresee a use case, where a client is genuinely interested in UTF-16 positions?
I'm not exactly sure what you mean, but in our case this is really inescapable. The LSP spec insists that positions use UTF-16 code units, and GHC gives us positions using code points. We have to convert when the client sends us positions, and back when we send them to the client.
Maybe they'll change the spec (basically everyone hates the status quo), but I wouldn't hold my breath.
Please try using an unreleased version of text-rope. It makes TextLines interchangeable.
That looks perfect. I'm also happy to have conversion functions - perhaps you want to retain the option to change the representation so that the two modules differ.
No, I had entirely confused myself, sorry @Bodigrim . The current version is using the Rope
types, not the TextLines
types. I need to check whether going to TextLines
and doing the work is faster than staying with ropes but converting between them.
Okay, I pushed a version using the unreleased interchangeable TextLines
. I think this is better: we still pay a cost linear in the document to convert to TextLines
, but we only do it once...
Perhaps the thing to do would be to extract just the line in question, and then only operate on that. That would probably allow us to do something relatively stupid and not worry too much, since we're relatively unlikely to have extremely long lines.
Okay, I think I'm done with this. Perhaps @wz1000 you could take a look and approve.
Thanks Pepe!
LSP
Position
s use UTF-16 code units for offsets within lines; most other sane tools (like GHC) use Unicode code points. We need to use the right one in the right place, otherwise we get issues like https://github.com/haskell/haskell-language-server/issues/2646.This is pretty unpleasant, since code points are variable-size, so you can't do the conversion without having the file text itself.
This PR provides a type for positions using code points (for clients to use to help them be less confused) and functions for using the VFS to convert between those and LSP positions.