microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
162.21k stars 28.55k forks source link

Support syntax highlighting with tree-sitter #50140

Open fcurts opened 6 years ago

fcurts commented 6 years ago

Please consider supporting tree-sitter grammars in addition to TextMate grammars. TextMate grammars are incredibly difficult to author and maintain and impossible to get right. The over 500 (!) issues reported against https://github.com/Microsoft/TypeScript-TmLanguage are a living proof of this.

This presentation explains the motivation and goals for tree-sitter: https://www.youtube.com/watch?v=a1rC79DHpmY

tree-sitter already ships with Atom and is also used on github.com.

jasonwilliams commented 2 years ago

@jeff-hykin from your experience, is this something that can be prototyped onto VSCode itself? or are there way too many changes?

I've noticed so far the only attempts have been in the form of a plugin rather than a change to the core

jeff-hykin commented 2 years ago

@jasonwilliams there are way to many changes. Actually, theres a linked issue somewhere in this thread from Alex that explains it in detail.

However Max Brunsfield (creator of the tree sitter) and the other Atom founders announced recently they're working on their own editor from scratch!

michaelblyons commented 2 years ago

https://zed.dev/

jasonwilliams commented 2 years ago

@jeff-hykin i have read Alex’s thread. I’m not convinced the issue is so much changing to tree sitter, it seems to be more moving tokenisation off the UI thread. Can tree sitter not be added by kept synchronous?

This is a hard problem but I don’t think it’s impossible. There may be some breaking changes along the way but it’s worth it.

Zedd looks interesting, I will keep an eye on it

jeff-hykin commented 2 years ago

@jasonwilliams true I guess that is the topic of that post.

Sadly though it is definitely not just a tokenizer that can be swapped out. Although semantic highlighting loosened this; from my understanding the theme system, internal functions, to code folding, to the syntax highlighting tags, along with many assumptions/optimizations are still tightly coupled to TextMate.

Here's a post that covers some of the early integrations. You may already know about it https://code.visualstudio.com/blogs/2017/02/08/syntax-highlighting-optimizations

Atom was less inter-dependent internally because it was designed to be hackable, but it still had/has per-language manually-written conversion layers from tree sitter tokens to TextMate tokens because it's just so hard to fully do the tree sitter justice and not break everything.

jasonwilliams commented 2 years ago

I took a look at the current state of play, and to see if it's possible to break this up.

It looks like for each language it goes off to fetch a tokenizator implementation (which happens to be the TMTokenization class for now created here

I wonder if as a start this can also support a treeSitterTokenization class which can then map its tree types back into textmate types so that things further upstream continue to work. It would need to implement ITokenizationSupport.

My understanding is https://www.npmjs.com/package/web-tree-sitter would work in all platforms web and desktop.

Tokenize is currently called line by line and expects a result in return. The new tokenization class would need to return a tokenizationResult to the implementation of EncodedTokenizationResult

From what I understand this is a similar approach the Atom team took when beginning to migrate.

Grammar Registry

One of the first tasks would be to support tree-sitter grammars along side textmate. A “type” field can be added and the path can point to the wasm file, this would be backwards compatible with textmate being the default. See https://github.com/atom/atom/commit/9762685106d161edf4a8df711278da47c170405f

The current grammar registry lives in vscode-textmate and it is specific to TMGrammars. You would most likely need a resolver and a registry equivalent which only fetch grammars that have a type tree-sitter. You would probably want the TMGrammar resolver to filter out gramnmars that are of type tree-sitter.

Another option is to have a new format entirely a la what anycode does.

I believe those changes could be done today without causing breaking changes.

TreeSitterTokenization

There would need to be a treeSitterTokenization class which is equivalent to TMTokenization class This class would also call on the grammar registry to fetch the right grammars and load them in. I think for now the tokenizer class may need to return tokens in the same format for compatibility upstream to be the same?

There would need to be some mapping somehow from tree-sitter types back to TM Types, although https://github.com/georgewfraser/vscode-tree-sitter does this so I'm guessing it's possible. If you map back there aren't a huge amount of changes, you could probably stop here and change things further upstream at a later point.

zm-cttae commented 1 year ago

Is there real tangible data on the performance change from using Tree-sitter?

jasonwilliams commented 1 year ago

@meche-gh https://github.com/microsoft/vscode/pull/161479

haikyuu commented 1 year ago

Slightly tangent to this, I find tree-sitter to be interesting for more than syntax highlighting. It's used in neovim, helix and other editors to power some very useful features: incremental selection, folding, indentation out of the box.

And most importantly, external modules built around tree sitter are extremely useful.

I'd say we approach including tree-sitter in VSCode more holistically:

These decisions will likely impact the first inclusion of tree-sitter into VSCode, be it syntax highlighting or others.

I don't know if it's clear or not. But including tree-sitter into VSCode is a huge benefit because it makes it aware of the code and not treat it like text. It may start with syntax highlighting (which is a bit already solved by textmate grammars) but doesn't end there.

If the benefit of having tree-sitter syntax highlighting isn't very big, I'd say it would be better to start with other simpler features that can live at the borders of VSCode as opposed to being in the core (syntax highlighting isn't simple to get right and not critical since it is working relatively well atm.)

When the basic setup of tree-sitter is done. A PR to have syntax highlighting will be much easier to build, review and merge.

jasonwilliams commented 1 year ago

Just giving my update and 2 cents.

@haikyuu those are some interesting thoughts, and I agree it’s a huge benefit all round.

I don’t agree about syntax highlighting being a solved problem because even though it “works” the performance is hitting its ceiling. I wrote about it here https://jason-williams.co.uk/posts/speeding-up-vscode-extensions-in-2022/ (see Tree Sitter section). If VSCode wants to stay competitive it will eventually need to migrate towards this in my opinion. Last time I looked at the performance of large files a lot of time was attributed to parsing.

I do agree with starting simple, but this will need to be in the core. I don’t want to see us go down a path of “everyone needs a tree sitter extension”, not that that’s what you were suggesting, but it would be good to see some roadmap for actually having it be the primary syntax system. My comment above looks into some first steps of adding it as a service then utilising it bit by bit, but still having the textmate system used primarily. This should provide at least some migration path for extensions going forward.

I did look into branching of https://github.com/microsoft/vscode/pull/161479 but it’s a monumental effort as it touches so many parts of the code base. So it isn’t something I could take on alone, especially if the maintainers are already planning to work on this (we don’t know, they are quiet on this topic, although there’s still positive signals they’re interested in investigating).

ABI Stability

There was concern over stability which may have been the reason progress in this area went quiet.

@alexdima did raise concerns around the ABI potentially changing causing extensions to break.

Although I reached out to the Tree Sitter maintainers who declared the library to be stable and there shouldn’t be any backwards incompatible changes. Secondly Neovim, who have been using Tree Sitter for over 2 years, have only had forwards compatibility issues but not backwards. The former are more easily solved by having extensions build trees against a specific version before publishing.

haikyuu commented 1 year ago

@jasonwilliams I agree this should land into core for optimal experience. And the performance benefit is not to neglect (I am personally using neovim at the moment and everything feels way faster)

zm-cttae-archive commented 1 year ago

There would need to be some mapping somehow from tree-sitter types back to TM Types, although vscode-tree-sitter does this so I'm guessing it's possible. If you map back there aren't a huge amount of changes, you could probably stop here and change things further upstream at a later point.

If the Tree-Sitter community wants to scale to existing themes, they need to plan their token names ahead of time and standardise it the way that Sublime and Textmate have done, and also the way Microsoft began to do with the LSP token format a couple months in.

zm-cttae-archive commented 1 year ago

Nvm, if the mapping is done by the grammar owner, that would be a small portion of the current effort needed to maintain Textmate grammars.. Would suck more if there was no TM grammar but even then the mapping would only be painful once

MixusMinimax commented 1 year ago

I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

While I agree with the fact that tree-sitter grammars also have their difficulties, the difference is that many of us need to define a tree-sitter grammar anyway, since you can use it for other things, like an LSP server, or a compiler. A textmate grammar would have to be maintained in parallel with whatever other parser generator you're using for other components.

codethief commented 6 months ago

Am I reading https://github.com/microsoft/vscode/pull/161479#issuecomment-1761967068 correctly that tree-sitter support is not going to happen any time soon? :\

texastoland commented 6 months ago

The team hasn't been active in any linked issues, hasn't publicly expressed intent to change direction, and arguably has competing interests with unifying VS Code syntax definitions with either VS and/or Monaco/Monarch. I consider this issue closed in practice.

serverhorror commented 6 months ago

unifying VS Code syntax definitions with either VS and/or Monaco/Monarch

so ... where do we go to ask ... "either VS and/or Monaco/Monarch" to adopt treesitter :)

texastoland commented 6 months ago

so ... where do we go to ask ... "either VS and/or Monaco/Monarch" to adopt treesitter :)

Visual Studio

Monarch (Monaco)

heartacker commented 6 months ago

https://github.com/microsoft/vscode/issues/206739 🛩️

Code Editor

Explore using the new EditContext API https://github.com/microsoft/vscode/issues/204371 @hediet 💪 Explore hover enriching https://github.com/microsoft/vscode/issues/195394 @aiday-mar @hediet 🔴 Explore tree-sitter parser ecosystem @alexr00

michaelblyons commented 5 months ago

https://github.com/microsoft/vscode/issues/207416 for those who missed it.