Open Risto-Stevcev opened 3 months ago
It can replace treesitter in a better way, since it's using the real parser of the language, rather than what treesitter does which is to reimplement the parser in an often ad-hoc, incomplete, and bug ridden way. For example, this capability could be used for
Note that one of the problems tree-sitter aims to address is to allow efficient updates of the syntax tree as changes are made to a document. What you call "real" parsers are often written with batch processing in mind, where a single character change in the source requires to parse the full document again.
neovim for example integrates both, tree-sitter and LSP and allows to use both for highlighting and with some languages it is very visible that the LSP highlighting lags behind if you type somewhat quickly And that's with semantic tokens already having a somewhat compressed encoding, so you don't need to transmit all the syntax nodes in a verbose format.
I'd say that for editors that already integrate something like tree-sitter (e.g. neovim, helix, and emacs via plugin) this is not very interesting, unless it would also include type information which might open the door for more advanced structured-search-and-replace capabilities.
(E.g. concrete example: With only the syntax tree information you couldn't reliable convert a assertThat(someVar).isEqualTo(true)
to assertThat(someVar).isTrue()
where assertThat(boolean).isTrue()
exists, but assertThat(Object).isTrue()
would fail)
What you call "real" parsers are often written with batch processing in mind, where a single character change in the source requires to parse the full document again.
That's not what I said in the proposal, this proposal isn't about parsing the full document. Languages that have an LSP implementation would implement this capability by leveraging their already existing and correct parsers to parse selections of code, rather than the entire document, making it efficient. Most languages are written in C, so the parsing step is already very fast. It would also be relatively easy for most languages to add support for parsing a selection of code, and many homoiconic languages even expose it as a function.
neovim for example integrates both, tree-sitter and LSP and allows to use both for highlighting and with some languages it is very visible that the LSP highlighting lags behind if you type somewhat quickly
That's because your example would inherently involve parsing the entire document as you said, since there isn't an LSP capability like this for languages to implement.
I'd say that for editors that already integrate something like tree-sitter (e.g. neovim, helix, and emacs via plugin) this is not very interesting
This aims to resolve many of treesitters flaws. I've used and written parsers for treesitter enough times to notice that a lot of treesitter parsers either:
I've also tried to use parsers in treesitter to leverage them to build interesting programming tools, and was very exciting to use it, only to find that they're often imcomplete, incorrect, or buggy and that it wouldn't work. This proposal comes from my own experiences.
I'll add a fourth point that parsers are hard to write, and hard to write correctly. By including this kind of capability, we will know that any error in the parsing is an error in the language itself, which makes it very likely that the parsing works properly.
It really would be a shame for this proposal to just get dismissed and rejected, because people might just skim this thread without understanding the what and why of this being proposed and because tree-sitter "already exists". If that does happen, I would strongly suspect that this issue would re-emerge in a new ticket years or decades later again.
Most languages are written in C
Citation needed. (Hint: LanguageServer.jl.)
I'll add a fourth point that parsers are hard to write, and hard to write correctly.
And are usually written (or generated) with a specific purpose in mind. Tree-sitter has one such purpose: incremental parsing and efficient evaluation of queries against the parsed concrete (not abstract!) syntax tree, mostly (but not exclusively) for the purpose of syntax highlighting. Latency is king here. This is a very different requirement than generating a static AST for the purpose of cross-project "code intelligence".
It really would be a shame for this proposal to just get dismissed and rejected
I would expect that this would happen because you make blanket statements without admitting to the complexity of the problem space. TL;DR: just because a tool is bad for your purpose doesn't mean it's bad for its own stated purpose. And behind this proposal lurks the false dichotomy that you can (or should) only use one parser technology for all purposes rather than the right tool for the right job.
we will know that any error in the parsing is an error in the language itself, which makes it very likely that the parsing works properly.
Oh, and the hidden assumption that language servers always parse everything perfectly. That's some powerful wishful thinking there, that not at all aligns with my experience. (Also, a number of language servers actually use tree-sitter under the hood for parsing. How does that fit into your argument?)
(All this is to say that your proposal for a new request would probably get a better reception if you didn't wrap it in an invective against other projects -- which isn't that even relevant for the proposal itself.)
That's not what I said in the proposal, this proposal isn't about parsing the full document. Languages that have an LSP implementation would implement this capability by leveraging their already existing and correct parsers to parse selections of code, rather than the entire document, making it efficient.
Highlighting is a case where incremental parsing can be used and where you can see today that the LSP version is often slower. This of course depends on the language server. Some do support incremental parsing too and are fast, others are based on parsers that were designed with batch processing in mind, and those do have to re-parse the full document to handle text edits. That's slower.
And parsing selections of the code isn't always easy - but that depends a lot on the language's grammar. (https://matklad.github.io/2023/05/21/resilient-ll-parsing-tutorial.html covers some of the problems)
Reading that article should also make it somewhat apparent that any proposal in regards to exposing a AST would have to specify how to deal with documents that contain syntax errors and can't be parsed completely. Code while being edited is often incomplete.
Most languages are written in C, so the parsing step is already very fast. It would also be relatively easy for most languages to add support for parsing a selection of code, and many homoiconic languages even expose it as a function.
Many compilers aim to be self-hosted. And language compiler != language-server Just take a look at the implementation languages in https://microsoft.github.io/language-server-protocol/implementors/servers/
It really would be a shame for this proposal to just get dismissed and rejected, because people might just skim this thread without understanding the what and why of this being proposed and because tree-sitter "already exists". If that does happen, I would strongly suspect that this issue would re-emerge in a new ticket years or decades later again.
This is partly why I mentioned that for editors like neovim the proposal would be a lot more interesting if it included type information. Tree-sitter can't provide that on its own so instead of being an alternative with different trade-offs, it would enable new use-cases.
All this is to say that your proposal for a new request would probably get a better reception if you didn't wrap it in an invective against other projects
This proposal is designed to address the limitations of treesitter, that's why I mentioned it's limitations.
I don't feel like it would be productive to continue arguing with either of you, I don't feel like either of you are acting in good faith for some odd reason. I'll let other people chime in and decide what they want to do with this ticket.
I don't feel like either of you are acting in good faith for some odd reason.
Funny, that's exactly the impression I got from your comments. I don't see what the limitations of tree-sitter have to do with the language server protocol as these address different use cases.
This is admittedly a pretty bizarre exchange for me, usually this doesn't happen when I write or comment on tickets. I don't really get it, but whatever.
I won't comment on the rest of this thread, I'll let others chime in and decide what they want to do with the ticket. I just don't want the actual idea of the ticket to get derailed by this since it's irrelevant to the actual idea itself.
I'll just add one other thing that language servers would have to implement besides parsing regions, which is incremental parsing. Though I think incremental parsing could be optional because the capability would still be really useful with regions alone for various tasks.
I don't have a strong opinion on standardizing an AST request, but for reference, clangd supports one as an extension.
It is not incremental or anything fancy like that; it requires that the document be already opened with didOpen
(clangd requires this for all document-scoped requests), in which case clangd has already built an AST for the full document (or if the AST build is not finished, requests that need it typically block on its completion), and it just walks the existing AST to provide the subtree that encloses the selection.
It was initially written to help debug the server by visualizing AST fragments in the client in a simple tree view, but it has since found use by some users wishing to extend the capabilities of the language server who find it easier to patch a client (e.g. because the client has an extensible plugin system) than to patch the server.
I'm having trouble seeing how a client could make use of this hypothetical generic AST. You give "syntax highlighting", "code folding", and "auto indentation" as examples of algorithms that could be implemented using it, but each of these is already an operation in the core LSP---and each involves a lot of language-specific logic/policy/opinions on top of the raw AST. You also claim it can be used for "any sort of static analysis of the syntax of the code", but I'm not convinced: even the most trivial of static analyses (formatting, organize imports) is highly language specific.
Abstracting away from the AST is why language servers exist. This proposal is in diametric opposition.
Hi folks,
I don't have any particular opinion about whether this proposal would be useful, but I'd like to share our experience in the coq-lsp server, where we have implemented a similar call.
In particular, most users of the coq/getDocument
feature, which returns the document splitted into sentences (and their associated) AST just want to determine "sentence" boundaries, so they go ahead and run actions, for example, pass the content of each sentence to a linter, etc...
More advanced users don't need the AST, but often an abstracted version of it. For example, CoqPilot wants the list of theorems and holes on them, etc...
So indeed, I am not sure any of our use cases are generalizable enough to warrant an LSP extension, but maybe a "split document into logical ranges" call could be considered, but IMHO only weakly.
Hope this helps, cheers.
maybe a "split document into logical ranges" call could be considered, but IMHO only weakly.
"Split document into logical ranges" is essentially what the LSP folding range operation already does.
"Split document into logical ranges" is essentially what the LSP folding range operation already does.
Actually in our case folding ranges is close but doesn't correspond to the splitting users need. They have been using selectionRange
, but indeed, it was a hack, and we implemented our own call.
Happy to provide details of document models where folding ranges don't correspond to sentence ranges if you folks are curious, but in the Coq/Rocq case it is easy to see:
Theorem easy_arith : 2 + 2 = 4.
Proof. simpl. reflexivity. Qed.
In this case we have 5 sentences and 1 folding range.
This is a proposal to add a
textDocument/ast
to the LSP protocol. The idea is to have a capability that returns an AST from a selection.The following exist but don't actually result an AST:
textDocument/documentSymbol
textDocument/semanticTokens
textDocument/definition
andtextDocument/references