Add functions to convert between code-point and code-unit positions

michaelpj commented 2 years ago

LSP Positions use UTF-16 code units for offsets within lines; most other sane tools (like GHC) use Unicode code points. We need to use the right one in the right place, otherwise we get issues like https://github.com/haskell/haskell-language-server/issues/2646.

This is pretty unpleasant, since code points are variable-size, so you can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to use to help them be less confused) and functions for using the VFS to convert between those and LSP positions.

michaelpj commented 2 years ago

@Bodigrim I attempted to follow the recipe you gave me here, but I'd appreciate a look to make sure I didn't do it wrong!

The process does seem to be quite expensive as it stands: both cases have to do a Rope1.fromText . Rope2.toText conversion, and going from code points to code units requires doing it twice! Awful.

michaelpj commented 2 years ago

Ah! The UTF-16 TextLines is just a newtype wrapper around the other one! So these conversions should be free, but the constructor isn't exported. Either exporting it or providing some conversion functions would be very helpful!

wz1000 commented 2 years ago

Would it be useful to track if the file consists entirely of latin-1 or whatever subset of codepoints there are such that the two notions of position are equivalent, and if so, avoid doing all the conversion work in those cases?

Bodigrim commented 2 years ago

Ah! The UTF-16 TextLines is just a newtype wrapper around the other one! So these conversions should be free, but the constructor isn't exported. Either exporting it or providing some conversion functions would be very helpful!

Please try using an unreleased version of text-rope. It makes TextLines interchangeable.

source-repository-package
  type: git
  location: https://github.com/Bodigrim/text-rope

The next step on my side would be to make a Rope type, allowing native indexing in both ways without conversions. The question is: do we foresee a use case, where a client is genuinely interested in UTF-16 positions?

@wz1000 let's make it correct first, and care about optimizations later. Unless you deal with extremely long lines (not files), this should not be a bottleneck.

michaelpj commented 2 years ago

Would it be useful to track if the file consists entirely of latin-1

Maybe! That would also have downsides - it would be some amount of ongoing work every time the file changes.

I agree with Bodigrim, basically: let's do it this way and optimise if it turns out to be a bottleneck.

The next step on my side would be to make a Rope type, allowing native indexing in both ways without conversions.

Well, since the rope types are interchangeable now, the current approach doesn't seem too bad. But maybe you have an idea for how to make it nicer or faster!

The question is: do we foresee a use case, where a client is genuinely interested in UTF-16 positions?

I'm not exactly sure what you mean, but in our case this is really inescapable. The LSP spec insists that positions use UTF-16 code units, and GHC gives us positions using code points. We have to convert when the client sends us positions, and back when we send them to the client.

Maybe they'll change the spec (basically everyone hates the status quo), but I wouldn't hold my breath.

michaelpj commented 2 years ago

Please try using an unreleased version of text-rope. It makes TextLines interchangeable.

That looks perfect. I'm also happy to have conversion functions - perhaps you want to retain the option to change the representation so that the two modules differ.

michaelpj commented 2 years ago

No, I had entirely confused myself, sorry @Bodigrim . The current version is using the Rope types, not the TextLines types. I need to check whether going to TextLines and doing the work is faster than staying with ropes but converting between them.

michaelpj commented 2 years ago

Okay, I pushed a version using the unreleased interchangeable TextLines. I think this is better: we still pay a cost linear in the document to convert to TextLines, but we only do it once...

michaelpj commented 2 years ago

Perhaps the thing to do would be to extract just the line in question, and then only operate on that. That would probably allow us to do something relatively stupid and not worry too much, since we're relatively unlikely to have extremely long lines.

michaelpj commented 2 years ago

Okay, I think I'm done with this. Perhaps @wz1000 you could take a look and approve.

michaelpj commented 2 years ago

Thanks Pepe!

haskell / lsp

Add functions to convert between code-point and code-unit positions #407