microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
160.59k stars 28.14k forks source link

Support syntax highlighting with tree-sitter #50140

Open fcurts opened 6 years ago

fcurts commented 6 years ago

Please consider supporting tree-sitter grammars in addition to TextMate grammars. TextMate grammars are incredibly difficult to author and maintain and impossible to get right. The over 500 (!) issues reported against https://github.com/Microsoft/TypeScript-TmLanguage are a living proof of this.

This presentation explains the motivation and goals for tree-sitter: https://www.youtube.com/watch?v=a1rC79DHpmY

tree-sitter already ships with Atom and is also used on github.com.

aeschli commented 6 years ago

tree-sitter is cool technology, and we have our eyes on it. I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

If you already have experiences with specific grammars, e.g. the TypeScript grammar or the C-grammar, and think it is superior to the TextMate grammars, let us know. That would be the criteria for us to invest.

Kroc commented 6 years ago

This may help in the future with the whole 'embedding one language in another', which is an enfant terrible when it comes to TextMate grammars.

omniomi commented 6 years ago

There's also a request in #5408 for .sublime-syntax which has been open since Apr 2016 which would also be a step up from .tmLanguage.

While tree-sitter has an awesome concept I can't say the idea of writing grammar in JavaScript is all that appealing.

fcurts commented 6 years ago

@omniomi tree-sitter also supports writing grammars in pure JSON if that's what you prefer. The main & dramatic advantage of tree-sitter is that it's a full parsing system and not an ad-hoc, underspecified, horrifyingly complex yet extremely limited regex contraption.

ahuertabhg commented 5 years ago

Integrating tree-sitter would help solve this issue https://github.com/OmniSharp/omnisharp-vscode/issues/2461

sean-mcmanus commented 5 years ago

@aeschli Atom has switched to tree sitter for C++ and no longer fixing issues with Text Mate: https://github.com/atom/language-c/issues/232#issuecomment-426018195 . Please advise on how we should proceed for improving the C++ syntax highlighting/etc. experience.

maxbrunsfeld commented 5 years ago

:wave: Just to reiterate - the Atom team doesn't intend to disrupt other apps like VSCode that are using modules like language-c. We will definitely continue to accept good PRs that update the text-mate grammar.

The reason that we've been closing issues like that is just to be explicit about the fact that our team won't be prioritizing work on them in the future, since Atom is moving away from text-mate grammars.

bobbrow commented 5 years ago

@sean-mcmanus we already have our own syntax highlighting stuff (shared with Visual Studio), but haven't been able to use it because we are waiting on an API that lets us turn off tmLanguage and provide the coloring ourselves: #585. Moving to tree-sitter is only relevant to us so long as #585 is incomplete.

bklebe commented 5 years ago

Tree-sitter is extensible for other programming languages, and in particular already supports Rust and Ruby as well. Are the Visual Studio APIs ready to be extended with new language support in those ways?

Geobert commented 5 years ago

I'm wondering if tree-sitter can solve this https://github.com/Microsoft/vscode/issues/51157

Stanzilla commented 5 years ago

@bobbrow is that a finite decision? Would have been nice to share the code with Atom here.

Astrantia commented 5 years ago

No plans for this in 2018?

github-yxb commented 5 years ago

It's going to be 2019!!

sean-mcmanus commented 5 years ago

Yeah, Atom 1.33 ships with tree sitter and most of the C/C++ colorization bugs have been fixed with it -- the Atom/language-c team is closing the non-tree sitter bugs.

fcurts commented 5 years ago

I fully agree with you that TextMate grammars are challenging to implement and have limitations, but it's always a lot of work to create and maintain a grammar. That will not be different with Tree-Sitter.

@aeschli I meanwhile re-implemented my TextMate grammar with tree-sitter because the former proved unmaintainable (templated regexes up to 400 characters long, etc.). Developing the tree-sitter grammar and highlighter from scratch took three days, compared to three weeks for the TextMate grammar. The new highlighter works better and is dramatically easier to maintain. I wish I could use it in VS Code as well.

Geobert commented 5 years ago

Still no plan for this? At least an exploration?

jeff-hykin commented 5 years ago

@fcurts Could I see your re-implementation? I currently maintain the TextMate C/C++ grammar. Its actually easy to maintain now that its written in Ruby with actual variables and functions instead of Regex, but it doesn't fix the inherent limitations of the TextMate engine.

I'd love to figure out how to convert it to a tree sitter syntax, the Atom tree sitter C++ syntax still isn't complete. To be honest even after spending awhile looking at it, I have no idea how to use it or how it works. Its missing functionality like lambda support, compile attributes, templated function calls, and macro calls. But it's also missing basic stuff like

// a comment \
        still part of the comment

I went and looked at ~the source code here, but its pretty unintuitive 😕. Its >6000 lines long, which is double the size of the TextMate grammar.~ Sadly its also already got unfixable issue posted on it :/

aClass thing = aClass(); // initializers can't be highlighted the same
aClass thing = aClass{};

I still think Tree sitters are awesome, but its definitely not going to be a quick win and I could use help with the upgrade process.

maxbrunsfeld commented 5 years ago

I went and looked at the source code here, but its pretty unintuitive and its >6000 lines long. Which is double the size of the TextMate grammar.

No, the source for the Tree-sitter C++ grammar is here. It's 669 lines of JavaScript (not counting the C grammar, which it inherits from).

Its missing functionality like lambda support, compile attributes, templated function calls, and macro calls.

It does fully support lambdas and template function calls, as far as I know. You can see the test suite for these features here. I do think that compile attributes are unimplemented.

jeff-hykin commented 5 years ago

That's great to hear, and thanks for correcting me! I'll try looking into the source to get a better understanding.

Sorry about the bad info, what mislead me about the lambdas is that Atom still has them incorrectly marked (although they're not visually messed up). The -> is marked as a member-access. Templated function calls are also not colored like functions (they have no color / theme-scopes). After seeing your tests though, this definitely a usage issue and not a tree sitter grammar issue. 👍

That test suite is nice. The nested template example right after the lambda is something I've wanted to solve in TextMate for months (but can't).

dberlin commented 5 years ago

FWIW: To make this more annoying for folks to choose - i have patches to add incremental parsing/lexing support to ANTLR4 (both the main java runtime and the optimized typescript runtime port).

Incremental parsing is submitted to both repositories (the main antlr4 one, and the optimized TS one) at this point, incremental lexing is not but will be in the next week or two.

I already use the incremental parser/lexers in my vscode extensions.

The lexing is the same set of algorithms tree-sitter uses, the incremental parsing is actually simpler because LL is top-down.

Speed wise, it can also relex/reparse on every keystroke with no issue. I mention it since ANTLR has a large collection of language grammars as well.

chfritz commented 5 years ago

@aeschli another benefit of tree-sitter besides syntax highlighting is that it lends itself to a much better indentation logic (https://github.com/atom/atom/pull/18321). Just take a look at this example file and compare the indentation there with what VSCode will currently produce. For example:

With tree-sitter :

foo( 2,
  {
    sd,
    sdf
  });

foo( 2, {
  sd,
  sdf
});

vscode:

foo( 2,
  {
    sd,
    sdf
  });

  foo( 2, {
    sd,
    sdf
  });

The second call to foo is at the root level, so why is it indented? The answer is quite simple: an inductive indentation approach that just considers the previous line to determine the indentation for the current line cannot handle multiple scopes opening on the same line but closing on different ones, which is what these examples show. There are ways to deal with that, but they are not as flexible to make the indentation look the way people expect them to be.

If you add tree-sitter to vscode then I'm happy to try and port https://github.com/atom/atom/pull/18321 to vscode.

georgewfraser commented 5 years ago

I recently made an extension that adds support for tree-sitter by replacing the builtin grammar with a simplified grammar that just colors literals and keywords, and then using the setDecorations API to apply tree-sitter based coloring to the tricky parts:

https://marketplace.visualstudio.com/items?itemName=georgewfraser.vscode-tree-sitter

For example, this is what Go looks like before and after installing the extension:

Screen Shot 2019-05-19 at 12 00 34 PM

It currently supports

and it's straightforward to add any of the available tree-sitter languages.

razzeee commented 5 years ago

Thanks for doing this, I really like the idea. I looked at doing that for elm and decided against it for the moment. I think it would be better to have it in my elm plugin and the best approach would be to actually have it in the language server, whenever this gets merged. https://github.com/Microsoft/language-server-protocol/issues/513

razzeee commented 5 years ago

@georgewfraser does this approach also work for the embedded markdown code? For e.g. when you get a completion and it shows some docs, which have some code embedded?

dannymcgee commented 5 years ago

While tree-sitter has an awesome concept I can't say the idea of writing grammar in JavaScript is all that appealing.

I know this is an old comment, but I don't really understand this logic. I've written (or attempted to write) several personal-use grammars for VS Code, and everything about the tmLanguage.json system is hopelessly awkward and unintuitive.

  1. There's a steep learning curve to get even a basic grasp on how TextMate grammars are structured
  2. It uses Oniguruma for its regex library, making the learning curve even steeper
  3. Since the regexes need to be written as strings, extra escape characters are needed, making them more difficult to read and write
  4. JavaScript is the single most widely used language on GitHub, and writing a grammar for Tree Sitter is relatively intuitive, so not only is it a vastly more powerful solution, but the barrier to entry for potential contributors is much lower

I realize that semantic highlighting is on VS Code's roadmap so this may be a moot point entirely (although frankly it has been "on the roadmap" for years with no movement), but I am really not seeing any downsides to "writing grammar in JavaScript" vs. VS Code's current implementation, which is frankly a nightmare

jeff-hykin commented 5 years ago

I think what @georgewfraser has made is great in the sense that; I don't think we need to wait on the VS Code's Core team to start work on a VS Code tree parser. I mean, sure, it being merely an extension isn't the best. But extensions can do almost everything core does, which is more than enough to kickstart the work. The more support we build for it the faster it will get merged into core, and I applaud @georgewfraser for effectively taking the first step 👏👏👏

EvgeniyPeshkov commented 5 years ago

Hello everyone. I've developed and published syntax highlighting extension based on Tree-Sitter. It provides universal syntax coloring engine for almost any programming language (currently, C and C++ are supported OOTB). It's very easy to add support for a new language. I'm planning to write HowTo in the next couple of days, but you can figure it out from the source code, it's very simple and straightforward. Contributions are welcome. I've been using the extension by myself for a month, so I suppose it's ready for public use. At least it can be useful until VSCode core provides stronger syntax parser.

You can install it from VSCode Marketplace. Or download .vsix package from GitHub page and install it manually. Please note, that extension published in VS Code Marketplace will only work in Windows-x64. For other operating systems, please download pre-compiled .vsix package. This will be fixed in the near future with one of the next updates. Alternatively, you can build extension from sources.

I've noticed, that @georgewfraser published his implementation a couple of days ago. I suppose we had the same thoughts. I'm very glad that qualitative alternatives to limited TextMate grammars begin to appear. Thank you, @georgewfraser.

Stanzilla commented 5 years ago

You guys should join forces :)

jeff-hykin commented 5 years ago

As an update to anyone interested in these extensions: Manual platform-specific .vsix installation is no longer required. @georgewfraser, @EvgeniyPeshkov, and I got both of the extensions running with Web Assembly so now they work out of the box on basically any platform.

NQDM-paul-sinclair commented 5 years ago

Does that extension fix this issue https://github.com/OmniSharp/omnisharp-vscode/issues/2461 ?

razzeee commented 5 years ago

It probably would, but the tree-sitter implementation for csharp is incomplete and i think it doesn't know what a define is right now. https://github.com/tree-sitter/tree-sitter-c-sharp

Geobert commented 5 years ago

Even if these extensions provide better syntax highlighting, it's not ideal because we can't use them with other extensions that uses the same Decoration API (see https://github.com/microsoft/vscode/issues/74692)

NQDM-paul-sinclair commented 5 years ago

Well that's disappointing, VScode wouldn't be bad for C# development if MS would properly support the language features. Will continue to use Visual Studio for C# projects. Shame really as VScode would work much faster for Unity projects and if they improved the debugging features on C# would make it the ideal replacement for VS.

I'm just surprised it hasn't been more of an issue.

jeff-hykin commented 5 years ago

To continue making progress on this, we can use the repo here to track and discuss the separate issues blocking the tree sitter implementation. I've added issues/labels for all of the known problems including the one @Geobert just pointed out. Feel free to ask questions, make feature requests, and ask for time estimates, or subscribe if you want incremental updates. I'll keep the readme updated with the general progress/timeline of the different extension implementations.

That way the extension contributors can have a central place for tracking/fixing issues, and this thread won't become bloated with every possible topic/question related to the tree-sitter. VS Code contributors can also use it to see what VS Code issues are upstream/blocking the tree sitter.

If there are any major breakthroughs to any of the extensions (or core), we'll make sure post on this thread. Right now the largest challenge is getting generic long term support for themes, with the secondary issue of fast colorization. There will be lots of internal progress this month, but it will likely take a month before the next major announcement.

atomiks commented 5 years ago

Incomplete list of features of a perfect syntax highlighter for me

Aside from the obvious like string, comment, keyword, there needs to be a level of granularity for me, and of course semantic recognition.

These are all in the context of JavaScript specifically.

These are possible in TextMate grammar, but method-call isn't added yet

// declaration
function func() {}
// call
func();
// method-call
obj.func();

Requires semantic highlighting

let outer = 0;

function func(a, b) {
  let inner = 1;
  // `a` and `b` should have a special parameter scope, both in the def 
  // and when using them
  // `inner` and `outer` are plain variables
  return a + b + inner - outer;
}

Possible in TextMate, but it was incomplete and was recently removed, I'd like to manually define these if possible

console;
window;
document;
this;
setTimeout;
requestAnimationFrame;
arr.slice();
document.querySelector();

Requires semantic highlighting

const objectLiteral = {}; // {} should not have a `punctuation` scope
const arrayLiteral = []; // [] should not have a `punctuation` scope

// But they should have `punctuation` scope here
{ /*  inside a block */ }
array[0] = 2; 

I think this is possible in TextMate, I believe this feature may exist already, but sometimes was buggy due to built-in DOM scopes/names

// `obj` should have a "top-level" object scope
// `one` and `two` should have a "sub-object" scope
// `three` should have a last-property scope
obj.one.two.three;

// Also object literal def properties have their own scope
obj = {
  prop: true // `prop` has its own scope
};

Requires semantic highlighting

class Hello {}
Hello; // should be same as the class def

// Primitives constants can (optionally) have their related scope
const number = 0;
number; // should have `number` scope along with regular `variable / constant` scope

// Constants also scoped when declaring & using them
const CONSTANT = '';

Possible in TextMate, but not every keyword has its own scope and it gets grouped with other ones in the current language grammar

// `import` and `from` should have a particular module keyword scope
import {module} from 'pkg';

// `function` and `return` should have their own scopes if they want as well
function x() {
  return true;
}

The current TextMate implementation already has a lot of these nice features, and when Atom switched to their new tree-sitter syntax, it lost parts of these features. So while it gained nice features, it also lost some.

Many people don't realize these particular granular features can be important (or realize they should exist), but they are important to me. So whatever new implementation gets added, please make sure these are all possible in their respective languages 👌

flowchartsman commented 4 years ago

As mentioned by @sean-mcmanus here and here, the dependency on the now unmaintained Atom TextMate grammars, is preventing issues from being addressed in the display of C/C++. You can add Go to the list now, too. The new %w format verb is not highlighted, and even the simple change necessary to make this possible can't go in because neither side is willing to make the change. This is only going to get worse.

matter123 commented 4 years ago

The C/C++ grammar is now being maintained at https://github.com/jeff-hykin/cpp-textmate-grammar/ if you have an issue with the syntax highlighting, please file it there.

flowchartsman commented 4 years ago

I am not talking about C/C++, I'm talking about Go and, presumably, any other language for which the highlight grammar is generated from an unmaintained, frozen source.

jeff-hykin commented 4 years ago

@flowchartsman clone the atom repo, and add the fix to it. That's the beauty of open source code. You can even make an extension that instantly applies the fix so that you don't have to wait for official VS Code support to get the benefits. I created a repo here fully setup for publishing the go syntax as an example. You're welcome to create issues on it if its unclear how it works.

If there is a better maintained version of TextMate syntax for go lang, I'm confident the VS Code team will switch to it. Someone just needs to take the initiative to create or suggest that more-maintained repo. @matter123 and I did it for C++, which is a good example of it getting better instead of worse. Matter123 and I have also created, and are working on documenting/refining, tools that make it easier for others to do the same.

I don't think Atom or VS Code teams are being stubborn sticking to frozen code. Someone just needs to get their hands dirty and implement/publish the fix themself.

sean-mcmanus commented 4 years ago

There's a pull request at the Go extension for adding tree sitter support, but it's been sitting around since June: https://github.com/microsoft/vscode-go/pull/2555 . @ramya-rao-a What's the plan for improving syntax colorization for Go?

flowchartsman commented 4 years ago

I don't think Atom or VS Code teams are being stubborn sticking to frozen code. Someone just needs to get their hands dirty and implement/publish the fix themself.

@jeff-hykin that kind of kicks the can on the main thrust of the issue here (multi-language dependence on stale atom code and difficult-to-maintain grammars), but I'll take it for now. Seems downright neglectful if I don't fork/PR now. I'll leave this issue to proceed on its own for now.

fcurts commented 4 years ago

Any updates?

We'd love to support VSCode again, but bringing back our TextMate grammar is not an option. Our tree-sitter grammar has proven to be far easier to maintain, and the highlighting is much better (in Atom).

Stanzilla commented 4 years ago

@fcurts https://github.com/microsoft/vscode/issues/77140

kolya-ay commented 4 years ago

If VSCode team don't want to rely on Atom's original C-based tree-sitter implementation (It's easy to imagine many different reasons not to use native node modules in the core), there might be worth to consider Lezer an JS-based implementation of the idea made by Marijn Haverbeke. The authorship almost guaranties great code quality and maintainability. As a benefit VSCode (Monaco Editor?) can share grammars with upcoming CodeMirror version which, without any doubt, will be written for any possible language.

razzeee commented 4 years ago

There are also docs for the new (reworked) tree sitter highlighting now https://tree-sitter.github.io/tree-sitter/syntax-highlighting

adrijshikhar commented 4 years ago

Hey everyone, I am a newbie to all this VS Code extension stuff. I wanted to learn how syntax highlighting takes place in the VS Code extension. I started working on a personal project to use TextMate grammar for highlighting javascript in order to learn. Over time, it grew complex. From this thread, I came across a new concept of tree-sitter. So can anybody help me or suggest me nice documentation on, how to port my existing syntax highlighting extension from TextMate grammar to tree-sitter, or if anyone has implemented tree-sitter syntax highlighting for javascript, please reply to this. I can really use some help. Thanks.

michaelblyons commented 4 years ago

or if anyone has implemented tree-sitter syntax highlighting for javascript, please reply to this.

You seem like a nice guy, so I'll try not to make this too harsh: The homepage of Tree Sitter has a link to the implementation of JavaScript by TS's original creator. Good luck, and maybe hunt a little harder next time you have a question like this one.

jeff-hykin commented 4 years ago

@adrijshikhar This isn't really the right thread for that kind of a question, a better place would be on the tree sitter repo here

What is relevant here is that: if you port your syntax from TextMate to the Tree Sitter it won't be natively supported in VS Code and the non-native support is still very rough. The Tree Sitter is awesome, so I highly recommend learning it. However if you want something working in VS Code, me and another guy have been working on a library for making maintainable TextMate grammars for more than a year now. The library and documentation was finished recently, and I've been working on creating a tutorial. I'll likely publish the library and the tutorial this winter (2020) on Medium under "Make a TextMate grammar (without wanting to kill yourself)"

DrSensor commented 4 years ago

Has anyone in the vscode team explored using syntect engine? Converting TextMate grammar into sublime syntax definition is pretty straightforward.

jeff-hykin commented 4 years ago

~The conversion is straightforward, but the Sublime engine still has the vast majority of the same problems.~

Actually it is notably more powerful than TextMate as per https://github.com/sublimehq/sublime_text/issues/2241 (thanks @michaelblyons for pointing that out). It still has some limitations notably:

  1. Being much slower than TreeSitter, especially for single line parsing
  2. It still uses the very-broken TextMate scope selectors instead of the more powerful scm queries

That said, it would be much easier to implement.