tree-sitter / tree-sitter-c-sharp

C# Grammar for tree-sitter
MIT License
177 stars 47 forks source link

fix(perf): make optional semicolons external to avoid intensive build times #309

Closed brandonspark closed 1 year ago

brandonspark commented 1 year ago

What:

This PR makes the optional semicolons in the grammar external symbols, to avoid extremely intensive build times.

Why:

On my machine, building the parser as it currently stands takes on the order of 10 minutes, and at maximum, around 40G of memory on my machine.

From bisecting and seeing what has caused this massive memory spike, it seems that adding the optional semicolons in this PR: https://github.com/tree-sitter/tree-sitter-c-sharp/commit/7bfc87dba66906bbeb2722d8e0dab6800de53a18 is what originally caused the giant spike in time and memory. Before this PR, building the C# parser took around 15 seconds.

Removing all the optional semicolons from the grammar, and re-building, takes a palatable 1 minute and 12G on my machine.

Test plan:

tree-sitter test succeeds.

@maxbrunsfeld do you have any idea why this might happen? I'm afraid I don't understand enough of the tree-sitter algorithm to be able to say. I just know that this fixes the issue, or more accurately, sidesteps it. It would be nice to know whether this memory blowup is expected, or if there's some kind of underlying thing going on.

damieng commented 1 year ago

Thanks for this, it does seem to move the needle down a bit (the file size checks are one indicator) and I'm also unsure why but it might be a pattern we can use elsewhere.

https://github.com/tree-sitter/tree-sitter-c-sharp/commit/168390ca8fd6415dcbd976af8c6df833912683f6 is a fix that almost doubled the size of the parser so we might want to consider a similar treatment there if possible.

aryx commented 1 year ago

cc @mjambon about the memory issue and tree-sitter

aryx commented 1 year ago

@maxbrunsfeld if you have any insight why those semicolons are generating such as huge spike in time and memory, that would be really helpful. We actually started to reach some CI limits to run tree-sitter in ocaml-tree-sitter because of memory issue too.

maxbrunsfeld commented 1 year ago

~It looks like this grammar doesn't use the word feature, which makes lexical analysis of keywords more efficient in terms of code size. That would be the first thing I'd look into.~

Oh, sorry, the grammar already uses that. Scratch that idea.

damieng commented 1 year ago

I can cut the grammar size almost in half by replacing:

identifier: $ => choice($._identifier_token, $._contextual_keywords),

with

identifier: $ => $.identifier_token,

The problem is then you can't use contextual keywords like from and file as identifiers where you should be able to, e.g.

var file = "test";

Do you have any alternative ideas for how to tackle these contextual keywords Max?

maxbrunsfeld commented 1 year ago

What if you mark that _contextual_keywords rule as inline? I wonder if that would help.

damieng commented 1 year ago

Doing that causes conflicts so I have to change the _contextual_keywords conflict rules to be identifier... which reduces the parser.c but only by 0.5MB (48.9MB to 48.4MB) but also causes a regression in that global as a parameter name is no longer recognized as such and becomes the keyword again (Lambda parameter named global corpus test)

tamasvajk commented 1 year ago

@damieng I tried the above too. I could get rid of the regression by first removing the global rule: https://github.com/tree-sitter/tree-sitter-c-sharp/pull/313

However, the size impact is still small: 47.9MB vs 48.9MB.