feature: extending the typescript tree-sitter grammar to enable more advanced and contextual parsing

AlbertMarashi commented 2 months ago

Did you check the tree-sitter docs?

[X] I have read all the tree-sitter docs if it relates to using the parser

Is your feature request related to a problem? Please describe.

Currently, we have a few places where language injections will fail due to the nature of how tree-sitter works.

As far as I am aware, it does not seem possible to have tree-sitter treat a specific injection as a subnode within a sublanguage.

Let's take the following example:

<script generics="T extends string">

If we treat the attribute_value inside of the generics attribute as typescript, this will error in the typescript parser, since T extends string is not a valid top-level typescript statement.

This occurs in various other locations.

For example, we will want to treat the function in here as a function definition

{#snippet foo({bar})}

or likewise here as a function call expression

{@render foo({ bar: "baz" })}

This would bring the svelte tree-sitter more inline with how the tsx treesitter extends the typescript language definition to become a more precise and accurate parser that removes some of the complications around expression parsing.

Describe the solution you'd like

Per @savetheclocktower's suggestions (https://github.com/zed-industries/zed/pull/17529#issuecomment-2347860071), there are a few ways to implement support this.

Formally extend or reference the JS and TS grammars within tree-sitter-svelte (which would vastly increase the size of the parser, I imagine, but would allow for very granular usage of arbitrary rules)

Create one-off grammars based on the JS and TS parsers (respectively) that include just the subset of each parser needed to parse solely inside of a function arguments context (less coupled than the option above, but no less wasteful)

Inject the existing JS and TS grammars into these ranges and find a way to re-scope the ranges selectively (“if you're in a svelte snippet injection, scope this as function.definition even though it's a function call,” etc.; hacky and would involve lots of compromises)

I'm not so sure whether something like 3. is possible with treesitter unfortunately, therefore I think going with 1. will be best.

Describe alternatives you've considered

No response

Additional context

savetheclocktower commented 2 months ago

I'll keep typing my way through this until I figure out a better idea, because I don't even like any of the options I proposed. Option 1 has the best chance of shipping soon, but it couples things that shouldn't be coupled.

Things that would ideally be highlighted by the editor's separate TS/JS Tree-sitter parsers:

SCRIPT tags

Things that would be highlighted by coupled, special-purpose TS/JS rules pulled into tree-sitter-svelte:

<script generic="T extends string[]">
{#snippet foo({ bar, baz })}
Possibly the condition blocks of #if, :else if, etc.

Things that are in between, but are almost certainly better off being highlighted by the editor’s separate TS/JS parsers:

Expression tags
Arbitrary JavaScript blocks in @render, @each, @await

The painful part here is:

A user will expect all the JS/TS highlights to behave similarly to one another…
…but we have to pick particular versions of tree-sitter-javascript and tree-sitter-typescript that may not agree with what the editor has installed…
…and whoever writes the Svelte integration for a given editor is in charge of making sure that the highlighting queries for (JavaScript|TypeScript)-fragments-in-Svelte agree as much as possible with that editor's highlighting queries for JavaScript and TypeScript proper

savetheclocktower commented 2 months ago

Let's take the snippet example:

{#snippet foo({ bar, baz })}
{/snippet}

How much of this should be owned by a JavaScript/TypeScript injection?

One option is foo({ bar, baz }). In isolation, this gets interpreted as a function call with two shorthand properties. But that's wrong; what we want is to treat it like the foo({ bar, baz }) in…

function foo({ bar, baz }) { /* ... */ }

…in almost every way, since it's equivalent to a function definition. We want foo to be recognized as a major document symbol and bar and baz to be treated like function parameter definitions. But there's no good way to do this with injections — unless you adopt option 2 from the ticket description, but that's still a major headache.

Another option is what this grammar currently does: it parses as much as can be inferred by tree-sitter-svelte without it having to become a parser of arbitrary JS. It parses the foo, "(", and ")" as their own nodes, and leaves only the { bar, baz } as svelte_raw_text to be parsed by the injection. (That example would be trivial to parse on its own, but it's better to leave it to a JS parser, since each parameter can have an arbitrary default value.)

This results in worse injection parsing. That's because the bare code { foo, bar } in JS is not treated like an object literal — top-level braces are interpreted as statement blocks. JS takes this to mean that you are referencing two variables and doing nothing with them. If these parameters have default values, they get highlighted a bit better, but still not quite accurately.

The upside is that more of the structure of the snippet can be seen by tree-sitter-svelte. I went with this option because it's the only practical way to get snippet names to be recognized as document symbols. I don't think this is implemented in Zed yet, but in tags.scm this is expressed as…

(snippet_statement
  (snippet_start
    (snippet_name) @name)
) @definition.snippet

…and allows foo to show up in a symbol list in Pulsar:

But you're not obligated to inject only into svelte_raw_text! A third option is to keep the structure of option 2, but write the injection in such a way that it also includes the ( and ) adjacent to the svelte_raw_text. This at least makes the JS recognize bar and baz as shorthand object literal properties. (Still wrong, but less so.)

This is such an awkward problem because it breaks some assumptions of injections. A typical injection is saying, “hey, come parse this thing, because you know how to parse it and I don't.” All you know about the thing is that it's Language X, and you don't know how to parse it.

This is us saying “hey, come parse this thing, but parse it as though you're in the following situation…” — and any way of describing that situation involves knowing implementation details of the parser. My ideal feature here would be a way of desginating an arbitrary “entrance rule” for an injection — in this case, it'd be formal_parameters instead of program — but now there's a translucent window in what used to be a black box.

I'm honestly curious about how the Svelte folks want editors to solve this. I hope they've thought about it at least a little bit.

AlbertMarashi commented 2 months ago

You're right, this isn't a simple issue to solve, even if we go with option 1. we will still need to replicate the typescript highlights.scm, which I'd ideally like to avoid for the reasons related to maintinance overhead.

Perhaps my issue in tree-sitter may address it

https://github.com/tree-sitter/tree-sitter/issues/3625

clason commented 2 months ago

I said it before, I'll say it again: Tree-sitter is simply a poor fit for templating languages, and you'll be fighting the tool all the way. Since you're coming from Zed, you may have Max's ear so you could try to explain (very generally!) what is lacking from tree-sitter; maybe he'll see a way around it.

(For the record, 2. seems to be the most popular option -- see glimmer-javascript etc.)

AlbertMarashi commented 2 months ago

@savetheclocktower just linking this issue comment here

Currently, typescript thinks that this is a function call expression, so the child is being highlighted as a function.

https://github.com/zed-industries/zed/issues/18033#issuecomment-2359383981

theIbraDev commented 1 month ago

So if i get this correctly, there's no way to make a query for functions, classes, comments etc inside script tags? Was just trying to make some queries for those but didn't understand why nodes under (raw_text) & (svelte_raw_text) were invalid. This issue is the first one to show up on google when looking this up.

Kinda sucks because treesitter queries is a massive part of my workflow.

AlbertMarashi commented 1 month ago

So if i get this correctly, there's no way to make a query for functions, classes, comments etc inside script tags? Was just trying to make some queries for those but didn't understand why nodes under (raw_text) & (svelte_raw_text) were invalid. This issue is the first one to show up on google when looking this up.

Kinda sucks because treesitter queries is a massive part of my workflow.

No I believe that should be possible, however you would need to target the typescript injected syntax instead of svelte.

tree-sitter-grammars / tree-sitter-svelte