Closed rojas-diego closed 9 months ago
I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities. We could issue a warning to the client saying there might be some offsetting errors when utf-16 is the only supported format. It could look something like this:
async fn initialize(&self, params: InitializeParams) -> LspResult<InitializeResult> {
*self.workspace_folders.write().await = params.workspace_folders;
+ let position_encoding = params.capabilities.general.and_then(|general_cap| {
+ general_cap.position_encodings.and_then(|encodings| {
+ if encodings.contains(&PositionEncodingKind::UTF8) {
+ Some(PositionEncodingKind::UTF8)
+ } else {
+ self.client.show_message(MessageType::WARNING, "utf8 is not supported, defaulting to utf16 which may result in offset errors").await;
+ None
+ }
+ })
+ });
Ok(InitializeResult {
server_info: Some(ServerInfo {
name: "llm-ls".to_owned(),
version: Some(VERSION.to_owned()),
}),
capabilities: ServerCapabilities {
text_document_sync: Some(TextDocumentSyncCapability::Kind(
TextDocumentSyncKind::INCREMENTAL,
)),
+ position_encoding,
..Default::default()
},
})
}
Hi! Thanks for the quick response.
I think this is a separate issue from position encoding negotiation. Enforcing UTF8 position encodings will not prevent Ropey's line count from diverging from Tree Sitter's or VSCode's, rendering lsp::Range
s out-of-sync with the Document
.
Regarding the PositionEncodingKind
, we had to issue a fix for this as well in our fork.
After looking at llm-ls
's implementation, we noticed the mirror of the user's workspace goes out of sync and/or the server crashes when the user's document contains certain graphemes as the current implementation doesn't translate well between:
Here's a video showcasing 1) a crash and 2) llm-ls' mirror going out of sync due to some unicode characters.
https://github.com/huggingface/llm-ls/assets/63368264/0107739e-746e-475d-bc06-8e2a06f95dca
Here's a test case that illustrates this:
I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities.
This is by far the most convenient option, and we tried to go that route as well. Unfortunately, position encoding kind negotiation is a relatively new feature of the Language Server Protocol. It was introduced in 3.17.0, in 2022. As of writing, UTF-16 is the default, mandatory encoding of the protocol and must always be supported by servers.
I thought it was still worth a try but it turns out that the VSCode on my machine, the latest version, does not even support anything other than UTF-16.
[Info - 09:40:14] Available position encodings: Some([PositionEncodingKind("utf-16")])
In our case, we deemed it was necessary to stick to UTF-16 for the following reasons:
If you managed to successfully negotiate UTF-8 position encodings, we'd love to hear more!
Once again, I might be wrong about some of the details of this. It's based off my current understanding of LSP and library implementations which is quite modest.
Anyway, all this Unicode handling caused quite the headaches on our side so we'd be happy to upstream our tests and implementation which handles the translation between the different encodings if you want. We're also looking at other approaches to simplify this complexity and if you're interested, I'm happy to collaborate on this together!
Bon Dimanche 🤗
Thanks for the detailed response. I'd be happy to take a look at your code and more than happy to have you contribute to llm-ls.
My main concern is that the rope crate only supports utf8, so I'll have to check if there are other rope crates that do support utf16.
I'm also not sure how other editors fair regarding Unicode encoding, I'll take a look.
For now let's merge the current PR.
Awesome! I'll kickstart a PR today or tomorrow.
I see the CI is not passing, it seems it's failing on main
too, is that an issue?
I see the CI is not passing, it seems it's failing on main too, is that an issue?
I'm not sure where it's coming from, just updated the CI's secret, let's see if that was the issue.
Hi @McPatate, hope you're doing good!
I was working on upstreaming our document syncing implementation but two things happened:
I'm pretty sure our current implementation is correct but I also wouldn't bet my hand on it. Currently we have a few unit tests that ensure the documents are kept in sync by simulating different kind of edits but it's hard to know whether it will stand the test of thousand of user files/actions once in production.
I'd feel bad about sharing under-tested code hence I'm currently looking around the web to try to find suitable edit traces that could help us simulate real-world, long-lived, complex editing sessions in our tests and ensure that everything matches in the end.
tower-lsp
There are other inherent design decisions of tower-lsp that could lead in our case and yours to out-of-sync-documents or completions based on outdated context[^1] in rare and less rare cases so that ate up quite a bit of our time as we are looking to mitigate that too ☹️
Given all this, I've created a Gist and just dumped our document.rs
implementation for reference.
LMK. Cheers!
[^1]: See tower-lsp#284 deno#10437
Hey @rojas-diego, thanks for the detailed message.
I'm going to merge the PR, just ran testbed locally and lgtm. Don't hesitate to run testbed yourself in future PRs, this is the way I test llm-ls
.
Regarding 1.
, I think the tests you provided should cover most use cases. Maybe try to find unicode characters encoded with different sizes if it's not already the case. People will report bugs when they found them and I would assume that most code out there is ASCII only, with some occasional exception.
Regarding 2.
, I wouldn't worry too much about it for now. I do recall seeing strange behaviour that may be linked to such a thing, but cannot confirm as we have an encoding issue currently :)
Feel free to create a new PR and we can discuss the issues in more detail over there.
Thanks again for the effort!
Hey @rojas-diego, do you still want to contribute your gist to the project?
Hi! First of all, thanks a lot for open-sourcing this.
I was working on an internal fork of the project and noticed a potential issue. I have little experience with the LSP and Ropey so this might not be relevant. Anyway, here's the issue:
With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls' representation might have more lines.
See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments