Proposal to add textDocument/ast

microsoft / language-server-protocol

Defines a common protocol for language servers.

https://microsoft.github.io/language-server-protocol/

Creative Commons Attribution 4.0 International

11.31k stars 804 forks source link

Proposal to add textDocument/ast #1998

Open Risto-Stevcev opened 3 months ago

Risto-Stevcev commented 3 months ago

This is a proposal to add a textDocument/ast to the LSP protocol. The idea is to have a capability that returns an AST from a selection.

The specification can specify naming and formatting conventions for various tokens and structures that are common across various language ASTs, to maximize code reuse, while allowing for any language AST to include any additional structures that are unique to the language. It would return the AST in JSON format.
It can replace treesitter in a better way, since it's using the real parser of the language, rather than what treesitter does which is to reimplement the parser in an often ad-hoc, incomplete, and bug ridden way. For example, this capability could be used for:
- Robust, effectively error prone, syntax highlighting.
- Code folding.
- Auto-indentation that never messes up.
- Any other thing tree sitter does.
In general, it can also be used for any sort of static analysis of the syntax of the code, which could be combined with other LSP calls to do really interesting static analysis on the editor side, to add really useful features in the future.

The following exist but don't actually result an AST:

textDocument/documentSymbol
- Provides a hierarchical representation of the symbols in a document, such as classes, methods, and variables. While this doesn't give you a full AST, it offers a tree-like structure representing the code’s high-level organization.
textDocument/semanticTokens
- Returns token-level semantic information about the document. This is useful for syntax highlighting and understanding the roles of various elements in the code but is not a direct representation of the AST.
textDocument/definition and textDocument/references
- These requests return locations in the code related to a symbol, providing context-aware navigation that relies on an understanding of the code's structure.

mfussenegger commented 3 months ago

It can replace treesitter in a better way, since it's using the real parser of the language, rather than what treesitter does which is to reimplement the parser in an often ad-hoc, incomplete, and bug ridden way. For example, this capability could be used for

Note that one of the problems tree-sitter aims to address is to allow efficient updates of the syntax tree as changes are made to a document. What you call "real" parsers are often written with batch processing in mind, where a single character change in the source requires to parse the full document again.

neovim for example integrates both, tree-sitter and LSP and allows to use both for highlighting and with some languages it is very visible that the LSP highlighting lags behind if you type somewhat quickly And that's with semantic tokens already having a somewhat compressed encoding, so you don't need to transmit all the syntax nodes in a verbose format.

I'd say that for editors that already integrate something like tree-sitter (e.g. neovim, helix, and emacs via plugin) this is not very interesting, unless it would also include type information which might open the door for more advanced structured-search-and-replace capabilities. (E.g. concrete example: With only the syntax tree information you couldn't reliable convert a assertThat(someVar).isEqualTo(true) to assertThat(someVar).isTrue() where assertThat(boolean).isTrue() exists, but assertThat(Object).isTrue() would fail)

Risto-Stevcev commented 3 months ago

What you call "real" parsers are often written with batch processing in mind, where a single character change in the source requires to parse the full document again.

That's not what I said in the proposal, this proposal isn't about parsing the full document. Languages that have an LSP implementation would implement this capability by leveraging their already existing and correct parsers to parse selections of code, rather than the entire document, making it efficient. Most languages are written in C, so the parsing step is already very fast. It would also be relatively easy for most languages to add support for parsing a selection of code, and many homoiconic languages even expose it as a function.

neovim for example integrates both, tree-sitter and LSP and allows to use both for highlighting and with some languages it is very visible that the LSP highlighting lags behind if you type somewhat quickly

That's because your example would inherently involve parsing the entire document as you said, since there isn't an LSP capability like this for languages to implement.

I'd say that for editors that already integrate something like tree-sitter (e.g. neovim, helix, and emacs via plugin) this is not very interesting

This aims to resolve many of treesitters flaws. I've used and written parsers for treesitter enough times to notice that a lot of treesitter parsers either:

Don't parse parts of the AST that are hard to write in treesitter
Have bugs that cause treesitter to produce garbage and fail midway through or incorrectly parse something
There isn't even a treesitter parser for the language

I've also tried to use parsers in treesitter to leverage them to build interesting programming tools, and was very exciting to use it, only to find that they're often imcomplete, incorrect, or buggy and that it wouldn't work. This proposal comes from my own experiences.

I'll add a fourth point that parsers are hard to write, and hard to write correctly. By including this kind of capability, we will know that any error in the parsing is an error in the language itself, which makes it very likely that the parsing works properly.

It really would be a shame for this proposal to just get dismissed and rejected, because people might just skim this thread without understanding the what and why of this being proposed and because tree-sitter "already exists". If that does happen, I would strongly suspect that this issue would re-emerge in a new ticket years or decades later again.

clason commented 3 months ago

Most languages are written in C

Citation needed. (Hint: LanguageServer.jl.)

I'll add a fourth point that parsers are hard to write, and hard to write correctly.

And are usually written (or generated) with a specific purpose in mind. Tree-sitter has one such purpose: incremental parsing and efficient evaluation of queries against the parsed concrete (not abstract!) syntax tree, mostly (but not exclusively) for the purpose of syntax highlighting. Latency is king here. This is a very different requirement than generating a static AST for the purpose of cross-project "code intelligence".

It really would be a shame for this proposal to just get dismissed and rejected

I would expect that this would happen because you make blanket statements without admitting to the complexity of the problem space. TL;DR: just because a tool is bad for your purpose doesn't mean it's bad for its own stated purpose. And behind this proposal lurks the false dichotomy that you can (or should) only use one parser technology for all purposes rather than the right tool for the right job.

we will know that any error in the parsing is an error in the language itself, which makes it very likely that the parsing works properly.

Oh, and the hidden assumption that language servers always parse everything perfectly. That's some powerful wishful thinking there, that not at all aligns with my experience. (Also, a number of language servers actually use tree-sitter under the hood for parsing. How does that fit into your argument?)

clason commented 3 months ago

(All this is to say that your proposal for a new request would probably get a better reception if you didn't wrap it in an invective against other projects -- which isn't that even relevant for the proposal itself.)

mfussenegger commented 3 months ago

That's not what I said in the proposal, this proposal isn't about parsing the full document. Languages that have an LSP implementation would implement this capability by leveraging their already existing and correct parsers to parse selections of code, rather than the entire document, making it efficient.

Highlighting is a case where incremental parsing can be used and where you can see today that the LSP version is often slower. This of course depends on the language server. Some do support incremental parsing too and are fast, others are based on parsers that were designed with batch processing in mind, and those do have to re-parse the full document to handle text edits. That's slower.

And parsing selections of the code isn't always easy - but that depends a lot on the language's grammar. (https://matklad.github.io/2023/05/21/resilient-ll-parsing-tutorial.html covers some of the problems)

Reading that article should also make it somewhat apparent that any proposal in regards to exposing a AST would have to specify how to deal with documents that contain syntax errors and can't be parsed completely. Code while being edited is often incomplete.

Most languages are written in C, so the parsing step is already very fast. It would also be relatively easy for most languages to add support for parsing a selection of code, and many homoiconic languages even expose it as a function.

Many compilers aim to be self-hosted. And language compiler != language-server Just take a look at the implementation languages in https://microsoft.github.io/language-server-protocol/implementors/servers/

It really would be a shame for this proposal to just get dismissed and rejected, because people might just skim this thread without understanding the what and why of this being proposed and because tree-sitter "already exists". If that does happen, I would strongly suspect that this issue would re-emerge in a new ticket years or decades later again.

This is partly why I mentioned that for editors like neovim the proposal would be a lot more interesting if it included type information. Tree-sitter can't provide that on its own so instead of being an alternative with different trade-offs, it would enable new use-cases.

Risto-Stevcev commented 3 months ago

All this is to say that your proposal for a new request would probably get a better reception if you didn't wrap it in an invective against other projects

This proposal is designed to address the limitations of treesitter, that's why I mentioned it's limitations.

I don't feel like it would be productive to continue arguing with either of you, I don't feel like either of you are acting in good faith for some odd reason. I'll let other people chime in and decide what they want to do with this ticket.

clason commented 3 months ago

I don't feel like either of you are acting in good faith for some odd reason.

Funny, that's exactly the impression I got from your comments. I don't see what the limitations of tree-sitter have to do with the language server protocol as these address different use cases.

Risto-Stevcev commented 3 months ago

This is admittedly a pretty bizarre exchange for me, usually this doesn't happen when I write or comment on tickets. I don't really get it, but whatever.

I won't comment on the rest of this thread, I'll let others chime in and decide what they want to do with the ticket. I just don't want the actual idea of the ticket to get derailed by this since it's irrelevant to the actual idea itself.

I'll just add one other thing that language servers would have to implement besides parsing regions, which is incremental parsing. Though I think incremental parsing could be optional because the capability would still be really useful with regions alone for various tasks.

HighCommander4 commented 3 months ago

I don't have a strong opinion on standardizing an AST request, but for reference, clangd supports one as an extension.

It is not incremental or anything fancy like that; it requires that the document be already opened with didOpen (clangd requires this for all document-scoped requests), in which case clangd has already built an AST for the full document (or if the AST build is not finished, requests that need it typically block on its completion), and it just walks the existing AST to provide the subtree that encloses the selection.

It was initially written to help debug the server by visualizing AST fragments in the client in a simple tree view, but it has since found use by some users wishing to extend the capabilities of the language server who find it easier to patch a client (e.g. because the client has an extensible plugin system) than to patch the server.

adonovan commented 1 month ago

I'm having trouble seeing how a client could make use of this hypothetical generic AST. You give "syntax highlighting", "code folding", and "auto indentation" as examples of algorithms that could be implemented using it, but each of these is already an operation in the core LSP---and each involves a lot of language-specific logic/policy/opinions on top of the raw AST. You also claim it can be used for "any sort of static analysis of the syntax of the code", but I'm not convinced: even the most trivial of static analyses (formatting, organize imports) is highly language specific.

Abstracting away from the AST is why language servers exist. This proposal is in diametric opposition.

ejgallego commented 1 month ago

Hi folks,

I don't have any particular opinion about whether this proposal would be useful, but I'd like to share our experience in the coq-lsp server, where we have implemented a similar call.

In particular, most users of the coq/getDocument feature, which returns the document splitted into sentences (and their associated) AST just want to determine "sentence" boundaries, so they go ahead and run actions, for example, pass the content of each sentence to a linter, etc...

More advanced users don't need the AST, but often an abstracted version of it. For example, CoqPilot wants the list of theorems and holes on them, etc...

So indeed, I am not sure any of our use cases are generalizable enough to warrant an LSP extension, but maybe a "split document into logical ranges" call could be considered, but IMHO only weakly.

Hope this helps, cheers.

adonovan commented 1 month ago

maybe a "split document into logical ranges" call could be considered, but IMHO only weakly.

"Split document into logical ranges" is essentially what the LSP folding range operation already does.

ejgallego commented 1 month ago

"Split document into logical ranges" is essentially what the LSP folding range operation already does.

Actually in our case folding ranges is close but doesn't correspond to the splitting users need. They have been using selectionRange, but indeed, it was a hack, and we implemented our own call.

Happy to provide details of document models where folding ranges don't correspond to sentence ranges if you folks are curious, but in the Coq/Rocq case it is easy to see:

Theorem easy_arith : 2 + 2 = 4.
Proof. simpl. reflexivity. Qed.

In this case we have 5 sentences and 1 folding range.