haskell / happy

The Happy parser generator for Haskell
Other
273 stars 85 forks source link

Encode 32 bit integers in arrays #266

Open sgraf812 opened 5 months ago

sgraf812 commented 5 months ago

There are multiple issues related to running into the 16 bit limit of the tables encoded by happy:

I don't see the appeal in having 16 bit array entries; let's double it to 32 bit.

More seriously, I tried bloaty on GHC's parser:

nix run nixpkgs#bloaty -- _validate/stage1/compiler/build/GHC/Parser.o
    FILE SIZE        VM SIZE
 --------------  --------------
  38.4%  1.14Mi  79.0%  1.14Mi    .text
  32.6%   991Ki   0.0%       0    .rela.text
  11.8%   360Ki   0.0%       0    .rela.data
   5.9%   179Ki  12.1%   179Ki    .data
   4.7%   142Ki   0.0%       0    .strtab
   3.8%   116Ki   7.9%   116Ki    .rodata
   2.0%  62.0Ki   0.0%       0    .symtab
   0.5%  14.3Ki   1.0%  14.2Ki    .rodata.str
   0.2%  6.95Ki   0.0%       0    .rela.rodata
   0.0%     192   0.0%       0    [ELF Headers]
   0.0%     187   0.0%       0    .shstrtab
   0.0%     146   0.0%       0    .comment
   0.0%     112   0.0%      48    .note.gnu.property
   0.0%      23   0.0%       0    [Unmapped]
 100.0%  2.97Mi 100.0%  1.44Mi    TOTAL

and then on the repro for #93 (Edit: it turns out that the linked executable contains the whole RTS of course; and the tables are all in .rodata contributing just 380KB):

nix run nixpkgs#bloaty -- tests/issue93
    FILE SIZE        VM SIZE
 --------------  --------------
  31.6%  4.24Mi  69.4%  4.24Mi    .text
  17.4%  2.33Mi   0.0%       0    .strtab
  15.5%  2.07Mi   0.0%       0    .debug_info
  10.4%  1.40Mi  22.9%  1.40Mi    .data
   8.2%  1.10Mi   0.0%       0    .debug_loc
   6.0%   827Ki   0.0%       0    .symtab
   3.2%   444Ki   0.0%       0    .debug_line
   2.8%   381Ki   6.1%   380Ki    .rodata
   2.7%   376Ki   0.0%       0    .debug_ranges
   0.9%   119Ki   0.0%       0    .debug_abbrev
   0.6%  76.1Ki   0.0%       0    .debug_str
   0.4%  49.2Ki   0.8%  49.2Ki    .eh_frame
   0.0%       0   0.3%  18.4Ki    .bss
   0.1%  10.2Ki   0.2%  10.1Ki    .eh_frame_hdr
   0.1%  7.49Ki   0.1%  5.27Ki    [23 Others]
   0.0%  5.76Ki   0.1%  5.70Ki    .dynsym
   0.0%  5.52Ki   0.1%  5.46Ki    .rela.plt
   0.0%  4.91Ki   0.0%       0    .debug_aranges
   0.0%  4.12Ki   0.0%       0    .debug-ghc-link-info
   0.0%  3.72Ki   0.1%  3.66Ki    .plt
   0.0%  2.68Ki   0.0%  2.62Ki    .dynstr
 100.0%  13.4Mi 100.0%  6.11Mi    TOTAL

So actually it's a bit larger than anticipated; I wonder why that is but I'm not going to investigate.

Anyway, I think it's far more important to guarantee that large parsers can be generated correctly rather than to generate them incorrectly in the smallest space possible.

int-index commented 5 months ago

Could we pick the entry size at code generation time? Use 8, 16, 32, or 64 bits depending on how many are needed.

sgraf812 commented 5 months ago

But that would mean more CPP when we want less. I honestly don't see the appeal in that; when all offsets fit in 16 bit, then the parser is so small that it doesn't matter whether it is twice as big. When offsets need 64 bits, the table will be at least 16GB I think, which is unrealistic as well.

Do also note that .rodata (where the tables are put) is just 116KB for GHC's parser, whereas .text (which contains the reduction actions) is 1.14MB. Point being: The tables aren't that large compared to the substantial amount of Haskell code we generate for all the data types, reduction actions etc.

As for issue93: I implemented the change to 32 bit (in https://github.com/haskell/happy/pull/272) and compiled the generated Haskell file with optimisations:

nix run nixpkgs#bloaty -- tests/issue93.o
    FILE SIZE        VM SIZE
 --------------  --------------
  22.1%   460Ki  48.7%   460Ki    .text
  16.7%   347Ki  36.7%   347Ki    .rodata
  14.5%   301Ki   0.0%       0    .strtab
  14.5%   301Ki   0.0%       0    .rela.text
  13.3%   277Ki   0.0%       0    .rela.data
  12.0%   249Ki   0.0%       0    .symtab
   6.5%   135Ki  14.3%   135Ki    .data
   0.3%  5.29Ki   0.0%       0    .rela.rodata
   0.2%  3.16Ki   0.3%  3.10Ki    .rodata.str
   0.0%     192   0.0%       0    [ELF Headers]
   0.0%     168   0.0%       0    .shstrtab
   0.0%     139   0.0%       0    .comment
   0.0%       9   0.0%       0    [Unmapped]
 100.0%  2.03Mi 100.0%   946Ki    TOTAL

Still only 347KB of .rodata compared to 460KB .text (reductions, data type stuff, etc.), vs. 266KB .rodata with happy-1.20. It appears that the amount of code we generate still surpasses the size of the table.

int-index commented 5 months ago

Alright. Only 16 and 32 bit are realistic scenarios. I threw in 8 and 64 just for good measure.

I honestly don't see the appeal in that; when all offsets fit in 16 bit, then the parser is so small that it doesn't matter whether it is twice as big

My understanding is that doubling the size of the table will make cache behavior worse. We're talking about an additional 250KB in case of GHC:

That's at most 250KB of tables (\xFF encodes one byte); doubling that to 500KB won't hurt.

On my machine cat /proc/cpuinfo reports cache size: 512 KB, so now the table will barely fit and leave no room for anything else.

Correct me if I'm wrong, I haven't done any hard measurements.

But that would mean more CPP when we want less.

I don't think we need to emit CPP. Couldn't we determine the appropriate entry size in the code generator itself? Or did you mean any sort of conditional whatsoever?

sgraf812 commented 5 months ago

I would not worry too much about caches; after all, we are still ultimately writing Haskell, where we have to allocate a lot. Plus, reduction actions are far more costly than a few fetches from RAM, with all those thunk evals and allocating a syntax tree.

As for GHC's parser, the more accurate .rodata size is 116KB (with 16 bit offsets), not my initial estimate of 250KB. For my PoC, that increases to 237KB, but .text (which includes the reduction actions) is still 1.25MB:

$ nix run nixpkgs#bloaty -- _quick/stage1/compiler/build/GHC/Parser.o
    FILE SIZE        VM SIZE
 --------------  --------------
  39.3%  1.25Mi  77.2%  1.25Mi    .text
  37.4%  1.19Mi   0.0%       0    .rela.text
   7.9%   258Ki   0.0%       0    .rela.data
   7.3%   237Ki  14.4%   237Ki    .rodata
   3.8%   123Ki   7.5%   123Ki    .data
   2.5%  80.3Ki   0.0%       0    .strtab
   1.1%  37.3Ki   0.0%       0    .symtab
   0.5%  16.8Ki   1.0%  16.8Ki    .rodata.str
   0.1%  4.16Ki   0.0%       0    .rela.rodata
   0.0%     192   0.0%       0    [ELF Headers]
   0.0%     187   0.0%       0    .shstrtab
   0.0%     146   0.0%       0    .comment
   0.0%     112   0.0%      48    .note.gnu.property
   0.0%      17   0.0%       0    [Unmapped]
 100.0%  3.17Mi 100.0%  1.61Mi    TOTAL

I would be really surprised if table accesses were the bottleneck in realistic uses of happy. Furthermore, the parser is hardly a bottleneck in a compiler or LSP at all (unless there's a bug in using the parser or its actions, as you recently found out), otherwise we'd be using https://github.com/knothed/happy-rad rather than a table-based backend. Alas, although a code-based parser is faster, it takes quite a bit longer to compile and the executable gets a bit larger; see Table 5.5-5.8 here.

Other than that, I haven't done any perf measurements either, but I'm convinced that it doesn't matter much.

int-index commented 5 months ago

I'm convinced that it doesn't matter much.

Alright. Perhaps someone someday will measure the actual effect of going from 16-bit to 32-bit, but it won't be me 😄 Correctness is more important anyway.