github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.11k stars 4.2k forks source link

Supporting Tree-sitter grammars #4342

Closed shaunlebron closed 5 years ago

shaunlebron commented 5 years ago

I'm not sure if GitHub is using tree-sitter for syntax-highlighting, but I saw in #4013 that the grammars are not supported in some way.

I created a syntax-highlighter using tree-sitter for my own purposes, and thought it might be helpful to share here: https://github.com/shaunlebron/highlight-tree-sitter

pchaigno commented 5 years ago

As far as I know, GitHub doesn't support tree-sitter grammars. This is not something that depends on Linguist anyway, so you should probably mention it to GitHub support if you'd like them to support tree-sitter grammars.

Alhadis commented 5 years ago

Background: Atom uses tree-sitter since it is a fast way to use proper grammars in an editor, removing the need for hacky regexes.

Just an FYI: those "hacky regexes" are precisely the reason for the flexibility and power of TextMate-based grammars. πŸ˜‰ One can use them to write structured grammars a la tree-sitter, or to highlight some ad-hoc format which lacks conventional or defined structure.

Having said that, supporting tree-sitter grammars won't be as simple as flicking on a light switch, so to speak.

pchaigno commented 5 years ago

@Alhadis Oh, Atom supports tree-sitter? If it does, it might be in GitHub's plans to support it as well...

Alhadis commented 5 years ago

The Atom developers started the tree-sitter project, so yes, it's only natural that Atom supports it. πŸ˜‰

pchaigno commented 5 years ago

Ahah! @vmg might know if there's planned support for tree-sitter in GitHub's syntax highlighter then.

shaunlebron commented 5 years ago

@Alhadis thanks for the note on "hacky regexes", I reworded it to remove the snarkiness since regexes have their place πŸ‘

I also realized that whatever GitHub uses to do its syntax-highlighting is probably private? Linguist only identifies which external grammars to use, and the grammar repos have nothing to perform the actual highlighting as far as I know:

Linguist detects the language of a file but the actual syntax-highlighting is powered by a set of language grammars which are included in this project as a set of submodules as listed here.

GitHub already diffs syntax trees created by tree-sitter for displaying Pull Request toc's, but doesn't seem to be using them for syntax-highlighting.

Alhadis commented 5 years ago

since regexes have their place

It's actually more than just regular expressions. πŸ˜‰ TextMate's strongest feature is its unassuming simplicity, and the ease with which structured grammars can be built from composing groups of smaller expressions.

It's also cheap and fast to syntax highlight a flat file in a top-down pass, whereas Tree Sitter obviously has to parse and pull an entire AST into memory before it can highlight regions of source code. For an interactive text-editor, it makes senes… but for the millions of static files being viewed across GitHub, the added overhead is wasted.

shaunlebron commented 5 years ago

@Alhadis thanks for extra context, I suppose server-side rendered files would make it a better fit

vmg commented 5 years ago

Thanks for the contribution @shaunlebron! We've been exploring using Tree Sitter for syntax highlighting on the website, but there are many technical challenges to overcome. We'll keep y'all posted.

pchaigno commented 5 years ago

Thanks @vmg for the info.!

I think we should close this in the meantime. As long as the backend doesn't support Tree Sitter, there is nothing we can do on the Linguist side.