Disable ropey `unicode_lines` feature

rojas-diego commented 10 months ago

Hi! First of all, thanks a lot for open-sourcing this.

I was working on an internal fork of the project and noticed a potential issue. I have little experience with the LSP and Ropey so this might not be relevant. Anyway, here's the issue:

With the current configuration, Ropey recognises more EOL sequences than the Language Server Protocol. This mismatch can lead to errors when trying to maintain a mirror of the user's documents as the llm-ls' representation might have more lines.

See: https://docs.rs/ropey/1.6.0/ropey/index.html#a-note-about-line-breaks See: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocuments

McPatate commented 10 months ago

I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities. We could issue a warning to the client saying there might be some offsetting errors when utf-16 is the only supported format. It could look something like this:

    async fn initialize(&self, params: InitializeParams) -> LspResult<InitializeResult> {
        *self.workspace_folders.write().await = params.workspace_folders;
+      let position_encoding = params.capabilities.general.and_then(|general_cap| {
+           general_cap.position_encodings.and_then(|encodings| {
+               if encodings.contains(&PositionEncodingKind::UTF8) {
+                   Some(PositionEncodingKind::UTF8)
+               } else {
+                    self.client.show_message(MessageType::WARNING, "utf8 is not supported, defaulting to utf16 which may result in offset errors").await;
+                   None
+               }
+           })
+       });
        Ok(InitializeResult {
            server_info: Some(ServerInfo {
                name: "llm-ls".to_owned(),
                version: Some(VERSION.to_owned()),
            }),
            capabilities: ServerCapabilities {
                text_document_sync: Some(TextDocumentSyncCapability::Kind(
                    TextDocumentSyncKind::INCREMENTAL,
                )),
+               position_encoding,
                ..Default::default()
            },
        })
    }

rojas-diego commented 10 months ago

Hi! Thanks for the quick response.

Ropey Line Breaks

I think this is a separate issue from position encoding negotiation. Enforcing UTF8 position encodings will not prevent Ropey's line count from diverging from Tree Sitter's or VSCode's, rendering lsp::Ranges out-of-sync with the Document.

Issues with Encoding

Regarding the PositionEncodingKind, we had to issue a fix for this as well in our fork.

After looking at llm-ls's implementation, we noticed the mirror of the user's workspace goes out of sync and/or the server crashes when the user's document contains certain graphemes as the current implementation doesn't translate well between:

LSP UTF-16 character offsets
TreeSitter's byte offsets
Ropey's characters offsets (read Unicode code points)

Here's a video showcasing 1) a crash and 2) llm-ls' mirror going out of sync due to some unicode characters.

https://github.com/huggingface/llm-ls/assets/63368264/0107739e-746e-475d-bc06-8e2a06f95dca

Here's a test case that illustrates this:

Long Test Case for `Document::change`

**Case:** ```rust mod test { use tower_lsp::lsp_types::Position; use tree_sitter::Node; #[allow(unused_imports)] use super::*; #[tokio::test] async fn test_document_change_tree_consistency_medium() { let a = "let a = '🥸 你好';\rfunction helloWorld() { return '🤲🏿'; }\nlet b = 'Hi, 😊';"; let mut document = Document::open("javascript", a).await.unwrap(); document .change(Range::new(Position::new(0, 14), Position::new(2, 13)), "，") .await .unwrap(); let b = "let a = '🥸 你好，😊';"; assert_eq!(document.text.to_string(), b); let mut parser = Parser::new(); parser .set_language(tree_sitter_javascript::language()) .unwrap(); let b_tree = parser.parse(b, None).unwrap(); assert!(nodes_are_equal_recursive( &document.tree.unwrap().root_node(), &b_tree.root_node() )); } #[allow(dead_code)] fn nodes_are_equal_recursive(node1: &Node, node2: &Node) -> bool { if node1.kind() != node2.kind() { return false; } if node1.start_byte() != node2.start_byte() { return false; } if node1.end_byte() != node2.end_byte() { return false; } if node1.start_position() != node2.start_position() { return false; } if node1.end_position() != node2.end_position() { return false; } if node1.child_count() != node2.child_count() { return false; } for i in 0..node1.child_count() { let child1 = node1.child(i).unwrap(); let child2 = node2.child(i).unwrap(); if !nodes_are_equal_recursive(&child1, &child2) { return false; } } true } } ``` **Output:** ``` running 1 test test document::test::test_document_change_tree_consistency_medium ... FAILED failures: ---- document::test::test_document_change_tree_consistency_medium stdout ---- thread 'document::test::test_document_change_tree_consistency_medium' panicked at 'assertion failed: `(left == right)` left: `"let a = '🥸 你好'，😊';"`, right: `"let a = '🥸 你好，😊';"`', crates/llm-ls/src/document.rs:293:9 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ```

Enforcing UTF-8 Position Encoding

I think the fix here should rather be to attempt to set the encoding to utf8 in the initialization call where server and client exchange capabilities.

This is by far the most convenient option, and we tried to go that route as well. Unfortunately, position encoding kind negotiation is a relatively new feature of the Language Server Protocol. It was introduced in 3.17.0, in 2022. As of writing, UTF-16 is the default, mandatory encoding of the protocol and must always be supported by servers.

I thought it was still worth a try but it turns out that the VSCode on my machine, the latest version, does not even support anything other than UTF-16.

[Info  - 09:40:14] Available position encodings: Some([PositionEncodingKind("utf-16")])

In our case, we deemed it was necessary to stick to UTF-16 for the following reasons:

We put all this effort into maintaining a perfect replica of the user's workspace to get accurate syntax trees to provide relevant completions at key points of the user's file. If we allow the user's file and our representation to become out of sync, we loose this ability and we would be better off with using regexes in the end.
Looking at VSCode's language client code, it doesn't seem to support UTF-16, concurring the experience I had on my machine. We didn't want to fight against the protocol and assumed many other clients would lack support for UTF-8 too.

If you managed to successfully negotiate UTF-8 position encodings, we'd love to hear more!

Once again, I might be wrong about some of the details of this. It's based off my current understanding of LSP and library implementations which is quite modest.

Anyway, all this Unicode handling caused quite the headaches on our side so we'd be happy to upstream our tests and implementation which handles the translation between the different encodings if you want. We're also looking at other approaches to simplify this complexity and if you're interested, I'm happy to collaborate on this together!

Bon Dimanche 🤗

McPatate commented 10 months ago

Thanks for the detailed response. I'd be happy to take a look at your code and more than happy to have you contribute to llm-ls.

My main concern is that the rope crate only supports utf8, so I'll have to check if there are other rope crates that do support utf16.

I'm also not sure how other editors fair regarding Unicode encoding, I'll take a look.

For now let's merge the current PR.

rojas-diego commented 10 months ago

Awesome! I'll kickstart a PR today or tomorrow.

I see the CI is not passing, it seems it's failing on main too, is that an issue?

McPatate commented 10 months ago

I see the CI is not passing, it seems it's failing on main too, is that an issue?

I'm not sure where it's coming from, just updated the CI's secret, let's see if that was the issue.

rojas-diego commented 10 months ago

Hi @McPatate, hope you're doing good!

I was working on upstreaming our document syncing implementation but two things happened:

1. Didn't get around to thoroughly test it

I'm pretty sure our current implementation is correct but I also wouldn't bet my hand on it. Currently we have a few unit tests that ensure the documents are kept in sync by simulating different kind of edits but it's hard to know whether it will stand the test of thousand of user files/actions once in production.

I'd feel bad about sharing under-tested code hence I'm currently looking around the web to try to find suitable edit traces that could help us simulate real-world, long-lived, complex editing sessions in our tests and ensure that everything matches in the end.

2. Uncovered additional issues this time relating to `tower-lsp`

There are other inherent design decisions of tower-lsp that could lead in our case and yours to out-of-sync-documents or completions based on outdated context[^1] in rare and less rare cases so that ate up quite a bit of our time as we are looking to mitigate that too ☹️

Given all this, I've created a Gist and just dumped our document.rs implementation for reference.

If it looks good to you, I'll make a PR to integrate it
If you feel like it would benefit from more testing, I'll circle back once we found a scalable way to test it

LMK. Cheers!

[^1]: See tower-lsp#284 deno#10437

McPatate commented 9 months ago

Hey @rojas-diego, thanks for the detailed message.

I'm going to merge the PR, just ran testbed locally and lgtm. Don't hesitate to run testbed yourself in future PRs, this is the way I test llm-ls.

Regarding 1., I think the tests you provided should cover most use cases. Maybe try to find unicode characters encoded with different sizes if it's not already the case. People will report bugs when they found them and I would assume that most code out there is ASCII only, with some occasional exception.

Regarding 2., I wouldn't worry too much about it for now. I do recall seeing strange behaviour that may be linked to such a thing, but cannot confirm as we have an encoding issue currently :)

Feel free to create a new PR and we can discuss the issues in more detail over there.

Thanks again for the effort!

McPatate commented 7 months ago

Hey @rojas-diego, do you still want to contribute your gist to the project?

huggingface / llm-ls