lezer-parser / lezer

Dev utils and issues for the Lezer core packages
33 stars 1 forks source link

generator runs out of memory on large grammar #44

Closed nchen63 closed 4 months ago

nchen63 commented 11 months ago

I have a large grammar that causes lezer-generator to run of a memory, even running node with --max-old-space-size of 24 gigs of memory.

I have tried to reduce the number of states by breaking out rules with a large number of optional rules into separate rules, but it hasn't solved the problem. One rule does have a ~600 or-ed tokens, and removing that did cause the grammar build to finish, but I don't know how that can be optimized. I tried breaking that up, but it didn't seem to help.

Can you offer guidance as to how the memory footprint could be reduced (either by modifying the grammar or with a PR)?

marijnh commented 11 months ago

The parser-generator algorithm used by this tool definitely has limits, and you'll want to keep your (expanded) grammar size within reasonable bounds. A rule with 600 or-ed tokens sounds like a bad idea. Do these tokens actually serve a different purpose in the grammar, or could you collapse them to a single more generic token type? Optionals and or-expressions are compiled by expanding them into multiple rules (i.e. (a | b) c? becomes a, b, a c, b c), and if you combine a bunch of them you can get a serious blowup in the amount of rules.

marijnh commented 4 months ago

Recent versions of lezer-generator should emit warning for rules that produce an enormous number of alternatives, which should help guide users on how to fix that. Beyond that, the LR parser generator algorithm just works this way, needing to generate the entire state space for the grammar, and I think there's always going to be a point of complexity where it breaks down (both in memory use and processing time use).