goccmack / gocc

Parser / Scanner Generator
Other
603 stars 48 forks source link

Add support for Pager's practical general method for constructing LR (k) parsers #34

Open mewmew opened 7 years ago

mewmew commented 7 years ago

As motivated by Marius in https://github.com/goccmack/gocc/issues/13#issuecomment-181827331

When I have the time again I would like to add Pager's PGM again because it generates smaller tables for full LR(1) grammars -- similar in size to LALR for the same grammar. I also think that the original paper is beautiful and that it is a pity that LALR became popular instead of this technique. The PGM parse table generator was working in gocc3 but not extensively tested.

mewmew commented 7 years ago

For grammars containing many production rules the shift/reduce tables tend to blow up in size. As an example, the LLVM IR grammar contains about 750 production rules (the full language may contain approximately 1000 production rules).

The generated action and goto tables of the parser are huge if uncompressed (i.e. not using -zip).

u@x220 ~/D/g/s/g/l/l/a/i/parser> du -hs actiontable.go gototable.go 
91M actiontable.go
57M gototable.go
u@x220 ~/D/g/s/g/l/l/a/i/parser> wc -l actiontable.go gototable.go 
  3108914 actiontable.go
  2860201 gototable.go
  5969115 total

The source code of these tables is almost 150 MB in total and contains almost 6 million lines of code.

For much larger grammars than this, using uncompressed tables starts to be infeasible.

https://github.com/goccmack/gocc/issues/28 tried to mitigate some of these issues by compressing the tables. While this works great, both for compilation speed and file size, some issues remain. First of, the runtime memory use of the uncompressed tables is still large. And secondly, it is not possible to track code coverage of production rules, when these tables are compressed, as a test case giving coverage for a single production rule, will give code coverage to all production rules (since the tables occupy a single line in the source code).