StgState allocations dominate

sgraf812 commented 2 years ago

Here's a profile of a simplified benchmark case of NoFib's bernoulli after #8 has been fixed:

COST CENTRE                          MODULE                        SRC                                             %time %alloc

lookupEnvSO                          Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(631,1)-(649,21)      6.1    3.4
evalStackContinuation.\              Stg.Interpreter               lib/Stg/Interpreter.hs:(355,74)-(394,35)          5.6    9.1
builtinStgEval                       Stg.Interpreter               lib/Stg/Interpreter.hs:(154,1)-(201,103)          5.1    4.5
evalExpr.\                           Stg.Interpreter               lib/Stg/Interpreter.hs:(497,45)-(502,23)          4.8    5.3
evalExpr                             Stg.Interpreter               lib/Stg/Interpreter.hs:(423,1)-(533,93)           3.9    1.0
compare                              Stg.Syntax                    lib/Stg/Syntax.hs:(30,3)-(32,12)                  3.8    0.0
evalExpr.\                           Stg.Interpreter               lib/Stg/Interpreter.hs:(504,37)-(510,27)          3.0    1.9
addInterClosureCallGraphEdge.addEdge Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:820:7-127             2.5    0.8
setInsert                            Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(793,1)-(795,36)      2.5    0.0
decodeStgbin'                        Stg.IO                        lib/Stg/IO.hs:52:1-22                             2.5    4.6
readHeap                             Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(655,1)-(660,71)      2.2    0.9
addIntraClosureCallGraphEdge.addEdge Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:831:7-127             2.1    0.8
lookupEnv                            Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:652:1-53              2.0    1.6
addBinderToEnv                       Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:621:1-49              2.0    1.6
lookup#                              Data.HashMap.Base             Data/HashMap/Base.hs:509:1-80                     1.9    0.5
compare                              Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:1224:17-19            1.5    0.0
matchFirstLit                        Stg.Interpreter               lib/Stg/Interpreter.hs:(537,1)-(544,112)          1.5    3.0
==                                   Stg.Syntax                    lib/Stg/Syntax.hs:75:13-14                        1.4    0.0
stackPop.\                           Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:560:57-166            1.3    0.7
evalStackMachine.\                   Stg.Interpreter               lib/Stg/Interpreter.hs:339:24-82                  1.3    2.5
setProgramPoint                      Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:841:1-80              1.3    9.8
stackPop                             Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(558,1)-(563,19)      1.2    4.9
builtinStgApply                      Stg.Interpreter               lib/Stg/Interpreter.hs:(204,1)-(237,69)           1.1    1.2
addZippedBindersToEnv.\              Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:624:60-86             1.1    1.2
matchFirstCon                        Stg.Interpreter               lib/Stg/Interpreter.hs:(564,1)-(569,31)           1.1    1.9
tryNextDebugCommand                  Stg.Interpreter.Debugger      lib/Stg/Interpreter/Debugger.hs:(28,1)-(34,12)    1.0    0.4
store                                Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(579,1)-(589,106)     0.9    2.6
store.\                              Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:580:32-70             0.7    1.7
freshHeapAddress                     Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(568,1)-(570,87)      0.7    2.4
declareBinding.\                     Stg.Interpreter               lib/Stg/Interpreter.hs:(579,22)-(584,58)          0.6    1.0
stackPush                            Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(553,1)-(555,96)      0.5    4.6
>>=.\.\                              Data.Conduit.Internal.Conduit src/Data/Conduit/Internal/Conduit.hs:152:51-68    0.5    4.0
store.\                              Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:589:38-106            0.5    1.7
addIntraClosureCallGraphEdge         Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(830,1)-(838,5)       0.3    1.3
addInterClosureCallGraphEdge         Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:(819,1)-(827,5)       0.3    1.3
freshHeapAddress.\                   Stg.Interpreter.Base          lib/Stg/Interpreter/Base.hs:570:30-87             0.2    2.1

Most of the functions there are related to stack or heap manipulation. Looking at the code and the fact that setProgramPoint (which does only one thing: modify the StgState's ssCurrentProgramPoint) contributes almost 10% of all allocations, I think the lovely simple design of a single StgState which contains the whole interpreter state in a huge immutable record might be the next bottleneck.

Unfortunately, we don't have mutable fields (yet) in GHC Haskell. So here are other suggestions:

Make all fields of StgState STVars or MVars. Probably the most performant option
Segregate StgState into two (or more) records StgStateHot/StgStateCold. Put hot stuff like ssCurrentProgramPoint in StgStateHot. Bonus points for a record pattern synonym that keeps the old interface (but then call sites must be absolutely sure to inline away the PS)

csabahruska commented 2 years ago

Having pure state was the main goal and achievement of the interpreter. This will not be changed for sure because it would ruin readability and simplicity. Haskell simply needs a better compiler. IMO it is a seriously bad habit to make Haskell programs more imperative to gain performance. Instead improve the compiler.

csabahruska commented 2 years ago

Use staged compilation to make it faster. https://github.com/AndrasKovacs/staged

csabahruska commented 2 years ago

Please customize the interpreter for your needs. The idea is that one could specialize and refactor the interpreter easily to do experiments without worrying the code quality and instead focusing on the creative and research domain specific parts.

sgraf812 commented 2 years ago

Having pure state was the main goal and achievement of the interpreter

Yes, and I agree that's a big deal. From what I heard, implementing our instrumentation ideas on top of your work was quite a breeze.

Instead improve the compiler.

A static analysis that reuses heap cells like that is non-trivial. I also live in the here and now, and at the moment we don't have such an analysis.

Use staged compilation to make it faster.

I agree that might be valuable path to explore, but that is not that much of a short-term solution. It is also unclear to me whether that even optimises away all the StgState overhead.

What do you think about my second suggestion?

Segregate StgState into two (or more) records StgStateHot/StgStateCold. Put hot stuff like ssCurrentProgramPoint in StgStateHot. Bonus points for a record pattern synonym that keeps the old interface (but then call sites must be absolutely sure to inline away the PS)

I think that will go a long way towards less copying of large StgStates and it won't impact customisability of the interpreter at all.

csabahruska commented 2 years ago

I do not plan to optimize the interpreter further. To me the interpreter should be a high level specification which is right now and it should not have optimization related noise at all. The reason why I stick to this idea is because I plan to do experiments where I use the interpreter as a specification literally and generate code from from it. (i.e. free monad based interpreter) So I need to keep the code simple. BTW you could optimize it for your custom research if you wish, just fork it. Do not look at the interpreter as a software product, so do not hesitate to do ad-hoc modifications on it, it's cheap. So please implement your optimization ideas by yourself in your fork.

A static analysis that reuses heap cells like that is non-trivial. I also live in the here and now, and at the moment we don't have such an analysis.

One of GRIN Compiler goal is is to experiment with such analyses and make it real.

grin-compiler / ghc-whole-program-compiler-project

StgState allocations dominate #9