Rewrite the grammar - Githubissues

amaanq commented 1 month ago

Reduces state count (tbh, I had it greatly reduced to around ~7-8k, but a few grievances grew it back to nearly what it was before, ~12k, further optimizations can come once I dig more later)
Fully implements C# 12 and some of 13, including primary constructors, ref readonly parameters, and aliasing any type.
Fixes parsing bugs related to interpolated strings when more than one $ is used by delegating this to the external scanner for statefulness by leveraging a stack of the current interpolation's info (yes, nested interpolations are handled nicely, no, it was not fun at times to implement)
Adds new bindings for other languages that upstream supports, tidied up manifests
CI is more robust using new upstream workflows
Improved queries
Updated tests, and removing duplicates/excessive tests. There's no need for one test per line of code/tree sitter rule, it just adds unnecessary higher-level nodes when several of these tests can be combined into one test to test them all together
Rewrote grammar to be more in-tune with the tree-sitter style, including getting rid of useless precedences, rules, hidden or not, inlining where appropriate, supertyping relevant rules, unhiding certain rules/supertypes, etc.
Improved preproc calls by having content be nested inside these nodes, but this does break the case where it's in the middle of an if statement for example, similar to C. I think the tradeoff is worth it, only ~80 files out of 8k+ files fail because of this, and we get much nicer parse trees in the common case, which, e.g., improves code folding. Another added benefit is we can use preproc rules in spots where only that rule itself is valid, and not everything. e.g. a preproc containing expressions is valid inside expressions, but this won't be done at the top level.
Fuzz the scanner for obvious reasons

I will update the tokens for publishing the new version later today (needs pypi).

Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/189
Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/283 (todo - quotes)
Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/310
Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/319
Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/326
Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/331
Closes https://github.com/tree-sitter/tree-sitter-c-sharp/issues/332

amaanq commented 1 month ago

Wow, those two dynamic precedence issues caught me off guard. I noticed some places where prec.dynamic was used unnecessarily, and I thought those two were as well. That was really interesting and after a little digging, it makes sense. This is probably one of few grammars that is very sensitive to dynamic precedence changes. Thanks for the context, I've added them back

Others:

A lot of the fields were unnecessary, however, if you think some of the ones i removed were necessary, let me know and I can add them back
I added this back, it's generally preferable to not assign a node to a single literal since it just "absorbs" that literal and you can't directly query for the string literal then, but for consistency with all the other literals it makes sense
I don't see the point in this, unless it resolved ambiguities/reduced state count but I don't think it did
I remember this initially reduced the state count a lot when I was starting my rewrite, but now I noticed it doesn't, so yeah I can just have it as it was before
Added back, a field called initializer could work though if we just want to tag/distinguish the identifier though

damieng commented 1 month ago

While I appreciate you've put a lot of effort in I'm also disappointed such a massive PR has landed with zero up-front discussion.

The extra plumbing around packaging and bringing in-line with other grammars is much appreciated but some of these changes would have been a lot easier to review with a bit of thought up-front - e.g. not re-ordering grammar.js where unnecessary and keeping the corpus files in their old location so we can diff (rename/move breaks after a certain amount of change).

Are there regressions with existing parsing? I can see a large list of exclusions on the CI but it's unclear if these are new.

amaanq commented 1 month ago

Well I wasn't aware I had to discuss making large improvements beforehand, nor have I ever done so in any other upstream grammar I maintain.

The moving of tests to a different dir is necessary, top-level corpus dirs are unsupported upstream now and must be in test/corpus.

The exclusion list is either a parse error where it genuinely was an error (beforehand as well) or it has funky usage of preproc ifs, e.g. in the middle of an if/else statement. I have (imo) improved them such that their contents are children of the preproc_if node, much like how C does it, however, this requires that the contents inside be somewhat correctly formed (e.g. a regular statement/expression, wherever the preproc_if is applicable). I think this is a reasonable tradeoff for much more navigable trees.

Tamás politely and graciously pointed out a couple of mistakes regarding some funky dynamic precedence, which I was a little taken aback by as I explained earlier, but those are now fixed.

amaanq commented 1 month ago

@damieng @tamasvajk the diff for tests should be much more readable now

maxbrunsfeld commented 1 month ago

Hope you're able to reduce the state count. In general, a great way to do that is avoid long sequences with many variations.

In other words, turn this:

seq(
  optional('foo'),
  optional('bar'),
  optional('baz'),
  optional('quux'),
  // ...
)

into this:

seq(
  repeat(choice('foo', 'bar', 'baz', 'quux')),
  // ...
)

It seems like there may be opportunities to do some of that in this grammar - there may be unnecessarily-specific sequences that could be modeled more generically.

amaanq commented 1 month ago

Down to 9700/3200 after reworking top level stuff + preprocs 😁

damieng commented 1 month ago

This is shaping up great! If we can get attribute_argument back in then I'm good on the Roslyn alignment/back compat from the naming side.

I think we just need to check/adjust the changes and removals of fields to make sure Semantic doesn't break and we should be good to go.

dcreager commented 1 month ago

I think we just need to check/adjust the changes and removals of fields to make sure Semantic doesn't break and we should be good to go.

Thanks for checking in, :+1: from our side! We're pinned to the current release so we won't silently upgrade, and we're using the syntax highlighting queries directly from this repo, which I see are updated as part of this PR. We have augmented the tagging queries with the ability to build up scoped names, which we'll have to update as part of bumping to the new version containing these changes. But the changes to tags.scm in this PR look manageable, so I don't consider that a blocker.

amaanq commented 1 month ago

thanks for the review/feedback @maxbrunsfeld @damieng @tamasvajk @hvitved!

hvitved commented 1 month ago

Improved preproc calls by having content be nested inside these nodes, but this does break the case where it's in the middle of an if statement for example, similar to C. I think the tradeoff is worth it, only ~80 files out of 8k+ files fail because of this, and we get much nicer parse trees in the common case, which, e.g., improves code folding. Another added benefit is we can use preproc rules in spots where only that rule itself is valid, and not everything. e.g. a preproc containing expressions is valid inside expressions, but this won't be done at the top level.

Is there any way, using extras as before this PR, to still tolerate #ifs that happen inside expressions or statements? For example, this file appears to have more parse errors now than before.

tree-sitter / tree-sitter-c-sharp

Rewrite the grammar #333