Closed stumash closed 3 years ago
/cc @dcreager
Hi @stumash! There are a couple of GitHub features that can use a language's tree-sitter grammar, with more in the pipeline. The intent is that each of the features build on each other, with a hopefully reasonable amount of work for each one.
The first is syntax highlighting. Most languages are still syntax-highlighted on GitHub using TextMate grammars, as defined by linguist project. We've started moving languages over to use parser-based syntax highlighting using tree-sitter. There's a pretty good set of documentation for writing syntax highlight rules in this repo using tree-sitter's query syntax. Once you have the tree-sitter highlight
command generating syntax highlighting that you're happy with, we add the grammar repo to our internal syntax highlighting service and you're done.
The next step would be Code Navigation. What's currently deployed on GitHub is "fuzzy" or "ctags-like" Code Nav, where we do a very simple textual match of symbol names. We use the same query language to define the rules for extracting definitions and references from a source file, which we call "tagging". (@maxbrunsfeld, I'm realizing we don't have a nice dedicated docs section for tagging like we do for highlighting.) The tree-sitter tags
command looks for tagging queries to generate symbol lists, much like the tree-sitter highlight
command does for syntax highlighting. Once you've got tree-sitter tags
producing symbol lists you're happy with, we'd add the grammar to our Code Nav service, start indexing All The Repos, and Code Nav should light up for Scala code.
We also have a more precise version of Code Nav that will be coming Real Soon Now. That will follow a similar strategy, though using a more complex DSL for defining the "actual" name binding rules of your language. Part of the "Real Soon Now" is getting the tooling and documentation in place to make it clearer how to author this new DSL. In the meantime, to whet your appetite, I gave a talk at Strange Loop a couple of weeks ago that goes into detail about how the formalism works.
Hope this helps; please feel free to ask any follow-on questions if the documentation links above aren't as clear as they should be.
Wow awesome, and really cool talk. I'm looking forward to all of it!
So:
The tree-sitter-graph
project extends the query syntax of tree-sitter
. The new syntax lets you define rules to generate the nodes and edges of a "stack-graph
" from a parsed tree-sitter
AST. The stack-graph
can then be used on-demand to find a symbol's references or definition -- accurately.
The current way to add support for symbol resolution for a language on Github is to add a tags.scm
file to the language's specific tree-sitter
repo. The tags.scm
file contains tree-sitter
queries with special node labels. It lets us make symbol lists that are currently what make symbol resolution possible in simpler cases.
Soon, it may be a different file using the new extended stack-graph
-generating syntax. This could potentially enable true arbitrarily complex symbol resolution for every inch of code on Github.
... did I get that right? 😅
Very very exciting.
An important problem though is that the tree-sitter grammar for Scala is very incomplete. If you run it on popular projects, you will get lots of ERROR nodes in the CST. Scala has complex lexical rules and lots of ambiguities in its grammar that are not resolved yet.
In fact it would be great to have some infrastructure for tree-sitter to run the different grammar on real projects and get statistics. We have something like this at R2C https://dashboard.semgrep.dev/ but it would be great to generalize this to any tree-sitter projects.
Hello @dcreager
Once you have the
tree-sitter highlight
command generating syntax highlighting that you're happy with, we add the grammar repo to our internal syntax highlighting service and you're done.
How do we know what is currently used by GitHub to perform syntax highlighting for a specific language? What is the process for updating it?
Also, is there a relationship between stack-graph and https://github.com/github/semantic? Is there a documentation page that tells language authors what are the possible levels of support for their language on GitHub (syntax highlighting, precise code navigation, actions, etc.) and what they need to do to get it?
Open Question
What work is left for Github to be able to use this project for its code navigation in Scala codebases?
According to this Github docs page, tree-sitter is what Github uses to support its code navigation features for languages like JS and Go. What features are not yet implemented in the Scala tree-sitter grammar that are needed for Github to be able to use it for code navigation?