Feature Request: Support stack-graphs to provide code nav in the absence of LSP, building on tree-sitter

EpocSquadron commented 2 years ago

What are stack-graphs

Stack-graphs were developed by Github to provide code navigation features to any of the ~500 programming languages hosted in Github repositories (although they have only defined it for python currently). Examples of such features are "go to definition" and "go to references". It builds on the information obtained from tree-sitter grammars to create incrementally-updated look-up graphs that allow sub-100ms navigation of named symbols.

Why would we want them in Helix?

Helix at this time relies exclusively on tree-sitter to provide basic highlighting and other more advanced features (such as selection by scope, which is still being worked into the default keymap). The conceit is that sourcing or writing a tree-sitter grammar for a language should be the preferred way of implementing support for a language for Helix. If we are to keep with this preference, we should endeavour to reap the maximum benefits of having a tree-sitter parsed syntax tree available for editor functions. This seems to be part of what sets helix apart from other editors.

Helix supports LSP for those features, why not rely on that?

There are a number of reasons:

Resource constraints

Many LSP servers are notoriously memory and cpu intensive. Depending on the kind of development machine you are running on, that may or may not be an issue even for one running LSP server. However, in heterogenous code environments (consider a website that uses typescript on the frontend, PHP on the backend and ruby for it's build system or to provide certain endpoints), or in environments where you are working on a number of projects throughout the day, whether or not all in the same language, it can quickly become limiting to rely on LSP in every case. It will likely be a concious decision to limit to a single LSP for the most used language simply to free up resources.

Not all languages have functional LSPs

Fairly recently, there was a lot of effort to utilize Julia's LSP server in the default configuration of Helix, but it was found that not only was the server excruciatingly slow to start up, but it wasn't entirely conformant or complete. There are certainly other examples of LSPs that only provide partial support for the standard. In those cases, it would be simpler to create a ruleset for stack-graphs to at least provide code navigation, rather than write or improve an entire LSP implementation.

Not all languages should have an LSP

Especially in the web sphere, there are many small domain-specific languages that are less a complete language and more of a configuration format, but which nonetheless support definitions and references of discrete portions of the configuration, and/or drop-in files to organize configuration across a number of files. Apache, nginx and varnish configuration files in particular come to mind. These will likely never get LSP servers written, as the benefit is too small, but it is certainly possible to write a tree-sitter grammer and stack-graph ruleset to make editing those files in Helix more ergonomic.

What else could be derived?

Besides go to definition (gd) and go to references (gr), it seems conceivable that a complete stack-graph can be used to produce a full list of symbols in file, and even workspace symbols.
Go to type definition (gy) and go to implementation (gi) are notably not talked about in the stack-graphs intro talk, but seem like they could be added down the road if not already available.
Having all of the information on where symbols are defined and used means we could theoretical implement an LSP- and language-agnostic version of rename-symbol.
"show docs under cursor" (<space>k) could theoretically utilize definition information and provide that as the popup information, similar to what github does with stack-graphs (try it on a python repo like flask to see what i mean).

What's involved in bringing it to Helix

First of all, the code behind the github feature is already open sourced as a Rust crate! Perusing the docs.rs for it, it seems only to provide the core functionality of incrementally creating the graph and querying it. This means we will need to take care of crawling the project (github uses the git trees to discover these) for all files of that type and generating the intial partial graphs. This should be cached somewhere (perhaps in a .helix in the project root directory, or in something like $XDG_RUNTIME_DIR/heix/path). As files are worked on in the editor, the partial graphs would be updated for that file in real time, then as functionality was invoked, we could query the full graph set to get the answer (this is the part that is always within 100ms in Github's testing).

Once functionality is present, we would need only find or implement support for each language with a ruleset. I'm not sure where Github stores the ruleset for python or if they intend to share what they develop for each language. I would assume they will, as they specifically called out taking advantage of the tree-sitter parses ecosystem. This would live in the runtime directory alongside the tree-sitter queries.

kirawi commented 2 years ago

This should be cached somewhere (perhaps in a .helix in the project root directory, or in something like $XDG_RUNTIME_DIR/heix/path).

I think some people want to avoid auto-generating folders. This might be better solved by #401

archseer commented 2 years ago

There's ~/.local/cache/helix, we can make it path specific.

archseer commented 2 years ago

We also want to use this together with the tree-sitter integration: https://github.com/tree-sitter/tree-sitter-graph

Some work needs to be done so that we also parse and construct these trees for not just the current project, but also the standard library and any imported libraries.

sudormrfbin commented 2 years ago

it seems conceivable that a complete stack-graph can be used to produce a full list of symbols in file, and even workspace symbols

This is already possible with tree-sitter-tags btw.

EpocSquadron commented 2 years ago

This is already possible with tree-sitter-tags btw.

Good to know! Does it do workspace symbols though?

sudormrfbin commented 2 years ago

Currently the crate takes a set of queries specifying which nodes are symbols and an input file and spits out the position of the symbols (so document symbols in essence) and if we do this for every file in the tree we would get workspace symbols.

the-mikedavis commented 2 years ago

I think this is a good idea, in some cases I might even prefer to turn off an LSP and just use tags-based navigation.

As @sudormrfbin says, stack-graphs is not necessary, it's only used by GitHub to power precise navigation for python. Basic (less accurate) navigation can be done with a tags.scm query and the tree-sitter-tags crate, although the way tree-sitter queries for tags is almost exactly the same as how it queries for highlights, so maybe some code could be reused between them.

I think basic tags are not accurate enough to be used reliably for go-to-definition, so IMO it's not worth parsing dependencies or the standard lib of a language. Being an alternate source of data for (workspace) symbols seems feasible though.

pokey commented 2 years ago

Maybe you've already found them, but here is a link to the draft Typescript graph definitions, in case helpful 😊

kirawi commented 1 year ago

I took a look at stack-graphs for Helix, and I don't feel like it's a good addition for Helix because of how complex it is to define queries while not providing a reasonable amount of accuracy. I particularly dislike the fact that .tsg introduces a DSL. However, I feel that tags.scm, locals.scm, and stack-graphs.tsg could be coalesced into a single query that could be used for all three use-cases. I liked how the indent queries were reworked in #1562 and #3382, so I feel like taking a similar path would be the best approach.

We could refer to GitHub's implementation of the stack-graphs algorithm, though I feel it may need to be modified a bit to better suit Helix. For one, I don't like how it utilizes insert-only arenas for the stack. This could be mitigated by creating partial graphs per file, writing them to disk, and then using memory-mapping to merge them into a single graph on the fly. That itself could be further optimized by utilizing the algorithm described in https://blog.burntsushi.net/transducers/#fsa-construction, though the improvement may be negligible because of the difference in scales.

kirawi commented 1 year ago

I think we should come up with a list of benefits from supporting stack-graphs. I feel that people should use the LSP for complex languages like TypeScript or Python.

Semantic syntax highlighting.
Navigation for simple languages.

jphamina commented 1 year ago

I have one additional argument for offering an LSP alternative to code navigation that hasn't yet been mentioned in this issue which is codebases that are difficult for LSP to process.

As my day job I'm working as a developer on an embedded C codebase that has:

Over 100 build targets, all with a different set compiler and preprocessor flags
Code for multiple processor cores being built in one compilation
Make based build system with some headers being generated build time
More than 3 compilers being used. Some clang based and others gcc based.
Very different target hardware environments, unit testing, simulator, emulator, FPGA and SoC

I've been fiddling with Helix lately since I'm interested in switching to a terminal based editor and I very much like the selection -> action over the action -> selection that vim uses. However, I have been unable to generate a proper compile_commands.json (with tools such as bear and compiledb) for even one of our supported build targets so far. Also, I assume that even if I'd manage to generate a full compile_commands.json file for the entire build it would be lacking for my purposes since 'go to definition' for example would just go to the definition that LSP sees as the correct definition for the build target whose compile_commands.json is active at that point. Instead, I'd prefer to be able to choose myself which definition out of the different target/simulation/UT functions I would like to go to.

Another (C++, not embedded this time but hardware testing related) codebase that I'm occasionally working on uses a proprietary compiler along with a proprietary build system. I don't think tools exist for generating compile_commands.json for this one. Fortunately the 'go to definition' of sublime text usually works here and when it fails there's always good old grep.

As a final disclaimer I don't have experience on how well stack-graphs in particular would work for my purposes but I know that Cscope, Ctags and the like work decently. For me personally the option of being able to use tag based code navigation over LSP would be the final push to allow me to use helix professionally.

daedroza commented 1 year ago

I have one additional argument for offering an LSP alternative to code navigation that hasn't yet been mentioned in this issue which is codebases that are difficult for LSP to process.

As my day job I'm working as a developer on an embedded C codebase that has:
* Over 100 build targets, all with a different set compiler and preprocessor flags

* Code for multiple processor cores being built in one compilation

* Make based build system with some headers being generated build time

* More than 3 compilers being used. Some clang based and others gcc based.

* Very different target hardware environments, unit testing, simulator, emulator, FPGA and SoC
I've been fiddling with Helix lately since I'm interested in switching to a terminal based editor and I very much like the selection -> action over the action -> selection that vim uses. However, I have been unable to generate a proper compile_commands.json (with tools such as bear and compiledb) for even one of our supported build targets so far. Also, I assume that even if I'd manage to generate a full compile_commands.json file for the entire build it would be lacking for my purposes since 'go to definition' for example would just go to the definition that LSP sees as the correct definition for the build target whose compile_commands.json is active at that point. Instead, I'd prefer to be able to choose myself which definition out of the different target/simulation/UT functions I would like to go to.

Another (C++, not embedded this time but hardware testing related) codebase that I'm occasionally working on uses a proprietary compiler along with a proprietary build system. I don't think tools exist for generating compile_commands.json for this one. Fortunately the 'go to definition' of sublime text usually works here and when it fails there's always good old grep.

As a final disclaimer I don't have experience on how well stack-graphs in particular would work for my purposes but I know that Cscope, Ctags and the like work decently. For me personally the option of being able to use tag based code navigation over LSP would be the final push to allow me to use helix professionally.

This exact thing I have achieved on neovim. I want to select which implementation/reference I would like to go to. And I wish if this degree of integration is present for ctags in helix, really willing to try!

lianghu commented 7 months ago

this is the killer feature for embedded code development targeting different hardwares with different compile configurations. pretty much like jphamina's situation https://github.com/helix-editor/helix/issues/1252#issuecomment-1509200237

helix-editor / helix