Closed sgraf812 closed 4 months ago
Could we pick the entry size at code generation time? Use 8, 16, 32, or 64 bits depending on how many are needed.
But that would mean more CPP when we want less. I honestly don't see the appeal in that; when all offsets fit in 16 bit, then the parser is so small that it doesn't matter whether it is twice as big. When offsets need 64 bits, the table will be at least 16GB I think, which is unrealistic as well.
Do also note that .rodata
(where the tables are put) is just 116KB for GHC's parser, whereas .text
(which contains the reduction actions) is 1.14MB. Point being: The tables aren't that large compared to the substantial amount of Haskell code we generate for all the data types, reduction actions etc.
As for issue93: I implemented the change to 32 bit (in https://github.com/haskell/happy/pull/272) and compiled the generated Haskell file with optimisations:
nix run nixpkgs#bloaty -- tests/issue93.o
FILE SIZE VM SIZE
-------------- --------------
22.1% 460Ki 48.7% 460Ki .text
16.7% 347Ki 36.7% 347Ki .rodata
14.5% 301Ki 0.0% 0 .strtab
14.5% 301Ki 0.0% 0 .rela.text
13.3% 277Ki 0.0% 0 .rela.data
12.0% 249Ki 0.0% 0 .symtab
6.5% 135Ki 14.3% 135Ki .data
0.3% 5.29Ki 0.0% 0 .rela.rodata
0.2% 3.16Ki 0.3% 3.10Ki .rodata.str
0.0% 192 0.0% 0 [ELF Headers]
0.0% 168 0.0% 0 .shstrtab
0.0% 139 0.0% 0 .comment
0.0% 9 0.0% 0 [Unmapped]
100.0% 2.03Mi 100.0% 946Ki TOTAL
Still only 347KB of .rodata compared to 460KB .text (reductions, data type stuff, etc.), vs. 266KB .rodata with happy-1.20. It appears that the amount of code we generate still surpasses the size of the table.
Alright. Only 16 and 32 bit are realistic scenarios. I threw in 8 and 64 just for good measure.
I honestly don't see the appeal in that; when all offsets fit in 16 bit, then the parser is so small that it doesn't matter whether it is twice as big
My understanding is that doubling the size of the table will make cache behavior worse. We're talking about an additional 250KB in case of GHC:
That's at most 250KB of tables (
\xFF
encodes one byte); doubling that to 500KB won't hurt.
On my machine cat /proc/cpuinfo
reports cache size: 512 KB
, so now the table will barely fit and leave no room for anything else.
Correct me if I'm wrong, I haven't done any hard measurements.
But that would mean more CPP when we want less.
I don't think we need to emit CPP. Couldn't we determine the appropriate entry size in the code generator itself? Or did you mean any sort of conditional whatsoever?
I would not worry too much about caches; after all, we are still ultimately writing Haskell, where we have to allocate a lot. Plus, reduction actions are far more costly than a few fetches from RAM, with all those thunk evals and allocating a syntax tree.
As for GHC's parser, the more accurate .rodata size is 116KB (with 16 bit offsets), not my initial estimate of 250KB. For my PoC, that increases to 237KB, but .text
(which includes the reduction actions) is still 1.25MB:
$ nix run nixpkgs#bloaty -- _quick/stage1/compiler/build/GHC/Parser.o
FILE SIZE VM SIZE
-------------- --------------
39.3% 1.25Mi 77.2% 1.25Mi .text
37.4% 1.19Mi 0.0% 0 .rela.text
7.9% 258Ki 0.0% 0 .rela.data
7.3% 237Ki 14.4% 237Ki .rodata
3.8% 123Ki 7.5% 123Ki .data
2.5% 80.3Ki 0.0% 0 .strtab
1.1% 37.3Ki 0.0% 0 .symtab
0.5% 16.8Ki 1.0% 16.8Ki .rodata.str
0.1% 4.16Ki 0.0% 0 .rela.rodata
0.0% 192 0.0% 0 [ELF Headers]
0.0% 187 0.0% 0 .shstrtab
0.0% 146 0.0% 0 .comment
0.0% 112 0.0% 48 .note.gnu.property
0.0% 17 0.0% 0 [Unmapped]
100.0% 3.17Mi 100.0% 1.61Mi TOTAL
I would be really surprised if table accesses were the bottleneck in realistic uses of happy
. Furthermore, the parser is hardly a bottleneck in a compiler or LSP at all (unless there's a bug in using the parser or its actions, as you recently found out), otherwise we'd be using https://github.com/knothed/happy-rad rather than a table-based backend. Alas, although a code-based parser is faster, it takes quite a bit longer to compile and the executable gets a bit larger; see Table 5.5-5.8 here.
Other than that, I haven't done any perf measurements either, but I'm convinced that it doesn't matter much.
I'm convinced that it doesn't matter much.
Alright. Perhaps someone someday will measure the actual effect of going from 16-bit to 32-bit, but it won't be me 😄 Correctness is more important anyway.
There are multiple issues related to running into the 16 bit limit of the tables encoded by happy:
I don't see the appeal in having 16 bit array entries; let's double it to 32 bit.
"#
, I count about 1M characters. That's at most 250KB of tables (\xFF
encodes one byte); doubling that to 500KB won't hurt.More seriously, I tried
bloaty
on GHC's parser:and then on the repro for #93 (Edit: it turns out that the linked executable contains the whole RTS of course; and the tables are all in .rodata contributing just 380KB):
So actually it's a bit larger than anticipated; I wonder why that is but I'm not going to investigate.
Anyway, I think it's far more important to guarantee that large parsers can be generated correctly rather than to generate them incorrectly in the smallest space possible.