Support syntax highlighting with tree-sitter

fcurts commented 6 years ago

Please consider supporting tree-sitter grammars in addition to TextMate grammars. TextMate grammars are incredibly difficult to author and maintain and impossible to get right. The over 500 (!) issues reported against https://github.com/Microsoft/TypeScript-TmLanguage are a living proof of this.

This presentation explains the motivation and goals for tree-sitter: https://www.youtube.com/watch?v=a1rC79DHpmY

tree-sitter already ships with Atom and is also used on github.com.

fcurts commented 3 years ago

2 years later, is there any consensus on how and when to offer a better alternative to TextMate grammars?

dannymcgee commented 3 years ago

2 years later, is there any consensus on how and when to offer a better alternative to TextMate grammars?

@fcurts "Semantic highlighting" has been available for a short while and it's basically the best syntax highlighting possible, especially when combined with the TM grammars.

It's only available for a few languages so far (It might still be in preview? Don't quote me on that), but bassically, it leverages a language service to determine the actual semantic type of any identifiers in your code and highlights them according to your theme's rules for that type. For tokens that don't have or don't need semantic information from a language service (like keywords and punctuation), it falls back to the TM grammars.

For example, in JavaScript:

const onClick = (event) => {
  // do something
}

document
  .querySelector('.my-button')
  .addEventListener('click', onClick) // <-- `onClick` gets colored as a function thanks to semantic highlighting

You need to be using a theme that has semantic highlighting enabled (the default themes do), or manually set it to enabled via a user setting override. As far as I'm aware, the supported languages currently are JavaScript, TypeScript, C++ (via the C++ extension), and C# (via the C# extension). There may be more. It's up to the author of any given language extension to provide the semantic highlighting tokens.

razzeee commented 3 years ago

F# also supports it.

But the language server spec for this is still preview and subject to change, as far as I followed it.

MewX commented 3 years ago

Would be so happy to see VSCode uses tree-sitter colorer.

Currently there's no available tree-sitter option unfortunately...

The current extension is broken randomly: https://github.com/georgewfraser/vscode-tree-sitter/issues/28

Btw, Atom has supported tree sitter 2 years ago: https://github.blog/2018-10-31-atoms-new-parsing-system/

dannymcgee commented 3 years ago

Btw, Atom has supported tree sitter 2 years ago: https://github.blog/2018-10-31-atoms-new-parsing-system/

The Atom team developed Tree-Sitter for use with Atom, so it's not really surprising that it's the first (only?) editor to have adopted it. It literally says that in the second paragraph of your link.

VS Code's new Semantic Highlighting implementation is now more accurate than Tree-Sitter since it uses a language service that analyzes your entire project (Tree-Sitter AFAIK has no information about the syntax tree outside of the file it's currently highlighting.)

jeff-hykin commented 3 years ago

@dannymcgee semantic highlighting wouldn't/doesn't replace Textmate. The tree sitter replaces textmate, and semantic highlighting would stay as-is.

Khady commented 3 years ago

Having tree sitter and providing an api for it would open the door for many interesting extensions too. We can imagine a paredit that is semantically correct for example.

it's not really surprising that it's the first (only?) editor to have adopted it. It literally says that in the second paragraph of your link.

neovim has it too.

XeonG commented 3 years ago

Has this issue https://github.com/OmniSharp/omnisharp-vscode/issues/2461 been fixed yet? or is vscode still inferior to visual studio for better c# development

tristan957 commented 3 years ago

This issue is finally on the second page of the issues backlog which is nice to see. Visibility on this issue is probably pretty low just because most end users don't know how painful Tm grammars are until they run into syntax highlighting issues. @jeff-hykin maybe it would help get more visibility if you advertised it on the "Better C/C++ Syntax" readme?

XeonG commented 3 years ago

Hopefully fixed soon.. Unity game dev is shit with Vscode for a number of reasons.. one of the biggest being so much of the code will use #if #else #end wrappers for different plugins, dev builds, etc any having all that unused code not visually greyed out is just rubbish to work with.. might aswel use notepad.

jeff-hykin commented 3 years ago

end users don't know how painful Tm grammars are until they run into syntax highlighting issues. @jeff-hykin maybe it would help get more visibility if you advertised it on the "Better C/C++ Syntax" readme?

@tristan957

With 330 issues (230 closed) on what is essentially a static json file, some issues of which are unsolvable, a general explanation in the readme wouldn't be a bad idea.

tristan957 commented 3 years ago

@jeff-hykin since you are familiar with at least enough of VSCode to put together syntax definitions, is it possible to have tree-sitter integration in an external extension? I think as Neovim 0.5 has been developed, the tree sitter integration actually lives in a separate extension, at least for now. https://github.com/nvim-treesitter

Like it seems that VSCode semantic highlighting completely takes over from Tm grammars when enabled, but maybe they work side by side together 🤷 . If they don't work together could tree sitter be another option like semantic highlighting is an option?

jeff-hykin commented 3 years ago

@tristan957 sort of. You can certainly get the tree sitter engine running, I packaged up the WASM version of it into georgewfraser's extension. I worked with him on it for a bit https://github.com/jeff-hykin/experimental-tree-sitter https://github.com/georgewfraser/vscode-tree-sitter

The problem is the extension uses decorations, which are slow. Really Really Really slow
There is no good way of accessing the user's theme settings. George's extension has its own colors independent of whatever your theme is. (You can manually customize the colors)
Assuming it did have access (there are hacky ways to gain access), there needs to be an implementation of a converter from the Textmate theme to the Tree sitter query language.

And there's smaller problems along with those. #1 is by far the most difficult issue, and the semantic highlighting might fix that, I'd have to learn more about the API.

I imagine Neovim gives extensions a lot more control than VS Code does with it's extensions.

NullVoxPopuli commented 3 years ago

Any updates on this? Tree-sitter's nested language parsing has been amazing in neovim.

LSP enhanced highlighting is great and all, but it's still just per language... Multi language support would still be super annoying. With tree sitter, we could have not only proper syntax highlighting for any combination of languages in the same file, but we could use the same language switching logic to run multiple language servers on the same file.

Would be great of vscode could catch up to the terminal-based editor. ;)

sfc-gh-pkommini commented 3 years ago

Any chance this is a go?

jaanli commented 3 years ago

Any updates on this? Would help greatly for accessibility and voice input with VS Code :( especially in ways impossible with LSP and Semantic Highlighting

XeonG commented 3 years ago

Any year now Vscode....

ThePrimeagen commented 3 years ago

Any year now Vscode....

Yes. Tree sitter is used in tons of editors now. Vim and emacs support it.

Writing contextual plugins is incredibly simple too

parascent commented 3 years ago

This I feel is a must have for all modern text editors along with LSP.

o314 commented 2 years ago

Assume your ambition Microsoft. You want to be an opensource leader, you want to be the greatest ide ? So use the correct technological stack at the parsing level for that. It will not be done with regex, and reexplaining it in the 202x is a fool's game

ghost commented 2 years ago

There should be a way to get a LSP running a tree-sitter configuration.

ghost commented 2 years ago

I have found the most probable issue with the tree-sitter extension

https://github.com/georgewfraser/vscode-tree-sitter/issues/28#issuecomment-890418181

It is deprecated atm. However, I hope a extended support fix is released so that we can use tree-sitter in Visual Studio Code

dannymcgee commented 2 years ago

There should be a way to get a LSP running a tree-sitter configuration.

Just spitballing, but it should actually be feasible to make sort of a "general purpose" LSP extension that takes a map of VS Code language IDs to tree-sitter parsers as configuration, and then uses those parsers to provide semantic tokens for the active file if it matches one of the supported languages. I'm not sure how this would work in case of a conflict though (i.e., if you have two extensions trying to provide tokens for the same language).

But for extension authors wanting a way to add support for some language without having to mess with TextMate grammars, that is absolutely an option. You don't even need to use the LSP protocol for that, you could just make a "standard" VS Code language extension and use a tree-sitter parser as the engine for everything — document symbols, semantic tokens, hovers, whatever.

ghost commented 2 years ago

The extension by @georgewfraser doesn't use LSP yet, but I'd like it to. Heck, I'd learn the API for LSP if it meant I could port tree-sitter without those slow decorators.

lnicola commented 2 years ago

I'd learn the API for LSP if it meant I could port tree-sitter without those slow decorators.

You mean this? https://microsoft.github.io/language-server-protocol/specification#textDocument_semanticTokens

razzeee commented 2 years ago

Assume your ambition Microsoft. You want to be an opensource leader, you want to be the greatest ide ? So use the correct technological stack at the parsing level for that. It will not be done with regex, and reexplaining it in the 202x is a fool's game

Yeah, if you read up on this, you can find out, that they want to use the parser that's at the core of visual studio. So they're aware, that regex isn't the way to go.

The extension by @georgewfraser doesn't use LSP yet, but I'd like it to. Heck, I'd learn the API for LSP if it meant I could port tree-sitter without those slow decorators.

What are you trying to do? I actually think it's a bad idea, to have extensions sliced like that (all languages, one feature) instead of the other way around. It might make more sense to go for something like an api helper, see the old testing extensions.

Milo123459 commented 2 years ago

My friend was talking about this issue and I had an "idea" - couldn't someone make a plugin to convert TreeSitter grammars to TextMate? It'd mean you'd be able to use TreeSitter without VSCode needing to change anything

ghost commented 2 years ago

What are you trying to do? I actually think it's a bad idea, to have extensions sliced like that (all languages, one feature) instead of the other way around. It might make more sense to go for something like an api helper, see the old testing extensions.

I agree, but if staff aren't budging on this, vscode-tree-sitter is also available as an NPM module for use in a separate extension. Thankfully the deprecation hasn't been pushed through on NPM or elsewhere.

ghost commented 2 years ago

My friend was talking about this issue and I had an "idea" - couldn't someone make a plugin to convert TreeSitter grammars to TextMate? It'd mean you'd be able to use TreeSitter without VSCode needing to change anything

Unfortunately, Tree-sitter offers a versatile parser that can't be expressed fully as a regular expression. Tree-sitter to Textmate is usually not as bad as parsing HTML with regular expressions, but it's a rather heavy downgrade, especially as Textmate doesn't support variable length lookbehinds.

Milo123459 commented 2 years ago

Oh that makes sense

dannymcgee commented 2 years ago

Heck, I'd learn the API for LSP if it meant I could port tree-sitter without those slow decorators.

It's honestly a lot more straightforward than you might think, just a matter of trudging through Microsoft's docs and example repos. And like I said, you don't even need to use a language server since tree-sitter has JavaScript bindings.

Here's a really straightforward example from a project of mine that I never finished:

// extension entry point
import { ExtensionContext, languages } from "vscode";
import SemanticTokensProvider, { SEMANTIC_TOKENS_LEGEND } from "./semantic-tokens";

export function activate(ctx: ExtensionContext) {
   let provider = languages.registerDocumentSemanticTokensProvider(
      { language: "glsl" },
      new SemanticTokensProvider(),
      SEMANTIC_TOKENS_LEGEND,
   );
   ctx.subscriptions.push(provider);
}

// semantic-tokens.ts
import {
   DocumentSemanticTokensProvider,
   SemanticTokens,
   SemanticTokensBuilder,
   SemanticTokensLegend,
   TextDocument,
} from "vscode";

import lexer, { TokenType } from "./lexer";
import parser from "./parser";

export enum SemanticToken {
   Macro = "macro",
   Function = "function",
   Param = "parameter",
   Variable = "variable",
}

export const SEMANTIC_TOKENS_LEGEND = new SemanticTokensLegend([
   SemanticToken.Macro,
   SemanticToken.Function,
   SemanticToken.Param,
   SemanticToken.Variable,
]);

export default class SemanticTokensProvider implements DocumentSemanticTokensProvider {
   provideDocumentSemanticTokens(doc: TextDocument): SemanticTokens {
      let builder = new SemanticTokensBuilder(SEMANTIC_TOKENS_LEGEND);
      let tokens = lexer.tokenize(doc);

      tokens
         .filter(tok => tok.type === TokenType.Ident)
         .forEach(tok => {
            let decl = parser
               .getScopeAt(doc, tok.range)
               .findDeclOf(tok.data);

            if (!decl) return;

            builder.push(tok.range, decl.semanticType);
         });

      return builder.build();
   }
}

You would just need to replace the custom lexer/parser there with tree-sitter, but you get the idea. I only processed the identifiers because I prefer leaning on TextMate for the unambiguous stuff like keywords and literals, but you could highlight everything with semantic tokens if you wanted. You would probably also want to look into how to take advantage of tree-sitter's incremental parsing for better performance.

ghost commented 2 years ago

I am actually working on a LSP for MATLAB for parameter highlighting on vscode-textmate. This looks pretty similar!

razzeee commented 2 years ago

And like I said, you don't even need to use a language server since tree-sitter has JavaScript bindings.

Unless you want to support other editors too.

jeff-hykin commented 2 years ago

Having a tree-sitter-to-LSP tool would absolutely be the best next step forward. It doesn't solve all of the problems, but it solves the biggest one.

@SNDST00M (and others) if you're interested in messing with the code to take advantage of the LSP let me know and I'll get you access to the experimental tree sitter.

Integrating the LSP likely wouldn't be too hard, however there is a sperate challenge. The tree sitter grammars are actually too powerful for the theming system. The grammars tell you everything about every token (which is awesome) but themes are way more dumb about what they color. There isn't a good generic way to map tree-sitter-information to the existing themes. Either all existing themes would break (no backward compatiblity), or there needs to be a custom converter for each language, in addition to a language parser. Once those values have been converted, you can then likely send them to the LSP.

For now though, that problem can be put off. The next step would be to replace the hand-picked decorators with hand-picked LSP values. Then later develop a more general solution.

ThePrimeagen commented 2 years ago

Does Microsoft have a license? https://regexlicensing.org/ I would prioritize this before vscode gets cited for use of a regex without a license.

dannymcgee commented 2 years ago

And like I said, you don't even need to use a language server since tree-sitter has JavaScript bindings.

Unless you want to support other editors too.

We're talking about a solution that essentially monkey-patches VS Code to support tree-sitter syntax highlighting via an extension. Maybe the Venn diagram of editors that support LSP but don't support tree-sitter highlighting is larger than I'm thinking, but for the scope of this particular problem it seems to me that the added complexity of managing a client/server IPC is not worth the trouble, at least for an initial proof-of-concept.

dberlin commented 2 years ago

Having a tree-sitter-to-LSP tool would absolutely be the best next step forward. It doesn't solve all of the problems, but it solves the biggest one.

@SNDST00M (and others) if you're interested in messing with the code to take advantage of the LSP let me know and I'll get you access to the experimental tree sitter.

Integrating the LSP likely wouldn't be too hard, however there is a sperate challenge. The tree sitter grammars are actually too powerful for the theming system. The grammars tell you everything about every token (which is awesome) but themes are way more dumb about what they color. There isn't a good generic way to map tree-sitter-information to the existing themes. Either all existing themes would break (no backward compatiblity), or there needs to be a custom converter for each language, in addition to a language parser. Once those values have been converted, you can then likely send them to the LSP.

I mean, for starters, you only have the construct the mapping once.

As for automatically generating the mapping, having had to do similar types of migrations (regexp based systems to real parsing based systems) there are ways to do this that work pretty well. You are right it is very hard to get rid of human intervention, but you can do a lot of the work for people.

For example,

For the textmate colorizers that have real test sets, you can see what the textmate grammar colorizes, and what token sequences tree-sitter is mapping those strings to, and make statistical guesses as to what you should map to what. You can then let the user confirm it, and generate the mapping.
For textmate colorizes that don't, you can enumerate the language the regex parses, and parse it with tree-sitter, and do the same thing. This will work, but requires more effort and will spend more CPU time. Depending how far down the rabbit hole you want to go, you can make it very efficient at enumeration, because tree-sitter should be able to tell you what starting letters/etc might possibly change the next token (if any), and so you can skip enumerating large parts of the regex language that won't change anything.
You can also do the first idea dynamically - have the tree-sitter grammar run in the background on the files as people use vscode, see how the colorizer colored them vs tree-sitter parse tokens, and collect the stats and send them somewhere. Use those stats of actual usage to generate the mapping.

jeff-hykin commented 2 years ago

I mean, for starters, you only have the construct the mapping once.

Once per language, and it needs to be maintained as the parser and language itself change. But yes, not only is it not insurmountable, but Atom actually has already done this for every language that it supports with the tree sitter.

jeff-hykin commented 2 years ago

@dberlin

As for automatically generating the mapping, having had to do similar types of migrations (regexp based systems to real parsing based systems) there are ways to do this that work pretty well

Maybe you know of a system I don't, but myself and matter123 looked into this kind of automated system a bit (we already have a automated scope testing system for the C++ textmate library) and it was a bit harder than you might expect. If you do create or know of a system for this, please do share. I do work for Atom, and not only would this be great for tree sitters but it would also be very helpful for converting/updating themes.

ghost commented 2 years ago

Tree-sitter and core grammars are available as an extension: https://github.com/EvgeniyPeshkov/syntax-highlighter

The extension can be installled from the Marketplace:

Syntax Highlighter - Visual Studio Marketplace

@aeschli @fcurts @jeff-hykin @dannymcgee does this resolve the issue and/or present need(s)?

jeff-hykin commented 2 years ago

@SNDST00M it does not resolve this issue.

That extension must be forked or modified to add a new language. This is very different from the existing system where an extension can add a grammar for a new language or override the grammar for an existing language.
One of the main benefits of the tree sitter is efficiency. Right now that extension is running on top of the exiting textmate engine, making it significantly less efficient than using the textmate engine by itself.
The mapping from tree sitter scopes to semantic tokens is very very limited compared to what current themes can do with textmate grammars. C++ has over 400 different types of tokens, in contrast that extension supports about 8 types of tokens.
VS Code still uses textmate scopes to make internal decisions about things such as autosuggestions. Textmate has a sever limitation that it cannot look at multiple lines to make decisions. This means certain kinds of syntax just cannot be parsed. If VS Code uses textmate with a tree sitter overlay things like intelligent suggestions could still be broken even if the colors look correct.
Backwards compatibility is essentially non-existant. Existing themes are not aware of the semantic names added by the extension. And even if they are, many of the scopes they take advantage of are not available in the new system so they couldn't update themselves even if they wanted to.

The extension is great, but it's not the answer. It's a demonstration of what could be.

dannymcgee commented 2 years ago

4. VS Code still uses textmate scopes to make internal decisions about things such as autosuggestions.

I could be wrong, but I don't think that's accurate. Completion suggestions are provided through their own language feature hook, with the implementation left up to the individual language extension. AFAIK, the scopes defined by a TextMate grammar are used only for the initial syntax highlighting pass and aren't even exposed to any other extension APIs. There is some sort of default completion provider that works in the absence of a language-specific one (e.g. in plain text docs), but as far as I can tell it just suggests whatever words already exist in that document.

I think the TextMate highlighting is pretty hard-wired into VS Code currently, so I don't think you could really turn it off without forking VS Code itself. There's a long-standing open issue about slow highlighting in extreme edge cases where someone suggests moving the TM highlighting off the main thread, but I think that's about as much as you can hope for in terms of performance improvements over the current setup.

Honestly, I don't think efficiency is a good argument for tree-sitter in VS Code. VS Code's TextMate highlighter is already doing incremental updates after the initial pass just like tree-sitter does, and I'm guessing (though I could be way off) that the "dumb" regex-based TM tokenizer would be faster than a tree-sitter CST parse, all else being equal.

The compelling arguments for tree-sitter vs TM grammars, IMO, are the semantic accuracy and (arguably) easier authorship. But to get the best semantic accuracy, you really need to parse not just the current file but also any external modules it's importing, which is always going to be slower than a simple line-by-line regex tokenizer.

I really think the best solution, for folks who really like tree-sitter and want to use it to implement language features in VS Code, is to just write a language extension that uses a tree-sitter parser as the engine. The VS Code language feature APIs are completely unopinionated as to how you implement them. You register a provider of some feature (semantic tokens, hovers, completion suggestions, symbol references, etc.), and VS Code invokes that provider with a file ID and line/column span at the appropriate time to request the data it wants. Everything else is up to the extension author.

KamasamaK commented 2 years ago

@dannymcgee TM scopes are used to determine token types (string, comment, other), which are used in many ways including in the editor.quickSuggestions setting. You are correct that they "aren't even exposed to any other extension APIs", which is probably why Jeff said "to make internal decisions".

And to add another thing TM scopes do, they are used to determine the (embedded) language that language snippets use.

ebkgne commented 2 years ago

2. One of the main benefits of the tree sitter is efficiency. Right now that extension is running on top of the exiting textmate engine, making it significantly less efficient than using the textmate engine by itself.

@jeff-hykin I have installed that add-on to use instead of "c_cpp:Enhanced Colorization" , and the latency in comments colorization is hugely reduced in my case

Somebody said that our problem here ( https://github.com/microsoft/vscode/issues/64681 ) is caused by TM but then I think it might be something else ?

fcurts commented 2 years ago

The compelling arguments for tree-sitter vs TM grammars, IMO, are the semantic accuracy and (arguably) easier authorship. But to get the best semantic accuracy, you really need to parse not just the current file but also any external modules it's importing, which is always going to be slower than a simple line-by-line regex tokenizer.

It's first and foremost about syntactic accuracy, which does not require looking at other files. It's impossible to do accurate syntactic syntax highlighting (and code folding, etc.) with TM grammars. Just take a look at the more than 700 (!) issues filed against the TypeScript TM grammar, the most complex regex spaghetti monster I've seen in my life. It's a never-ending "you fixed something here, but now it breaks when I add a newline there" nightmare. And that's after they added their own preprocessor to compose TM grammar fragments!

TM grammars need to die. It's a shame that after all these years, a leading editor such as VSCode offers nothing better. It's why we retired our VSCode plugin years ago and still haven't brought it back. In the meantime, we shipped plugins for Atom, Neovim, Emacs and Asciidoctor, all based on a shared tree-sitter grammar that took a fraction of the time to develop and maintain.

ghost commented 2 years ago

It's really annoying handling multi-line meta scopes like this:

classdef T15MultiLineClassdefHeader ... "it's a beautiful day to screw syntax up" - stinkmeaner
   < OtherClass % <- meant to be inherited class

It's almost like Textmate took pains to prevent people from supporting this kind of syntax

dannymcgee commented 2 years ago

Just take a look at the more than 700 (!) issues filed against the TypeScript TM grammar

That number literally includes every bug ever filed against it. That repo is 5 years old. Filter by label:bug -label:duplicate and you get 87 actual bugs, 86 of which are closed. Less than 20 bugs per year on a grammar for a language that's under active development is really not the nightmare scenario you're making it out to be.

fcurts commented 2 years ago

Filter by label:bug -label:duplicate and you get 87 actual bugs, 86 of which are closed.

I'm not sure "label:bug" is a reliable indicator. On the first issues page, only a single issue is marked with this label, yet almost every issue reports a highlighting bug.

Regardless, anyone who cares about correct syntax highlighting and has maintained a complex TM grammar (I have) will tell you that it's a nightmare and impossible to get right. It's the sole reason we retired our VSCode plugin.

gjsjohnmurray commented 2 years ago

I was interested to discover recently that some of the VS Code core team are starting to leverage Tree-sitter

https://github.com/microsoft/vscode-anycode

jasonwilliams commented 2 years ago

wow over 350 upvotes for this.

That extension looks interesting @jrieken is there any possibility of that work landing in VSCode core some point in future? I’m not sure it solves the syntax highlighting issue discussed here.

I know @alexdima was trying to do a refactor but said we were hitting our limits with the current system.

it seems like the wasm bindings for tree sitter will be useful for code

microsoft / vscode

Support syntax highlighting with tree-sitter #50140