evalPrimOp is a bottle-neck due to linear search

sgraf812 commented 2 years ago

While the implementation of, e.g., ...ByteArray.evalPrimOp is rather direct and elegant at the moment

newtype Name = BS8.ByteString

evalPrimOp :: PrimOpEval -> Name -> [Atom] -> Type -> Maybe TyCon -> M [Atom]
evalPrimOp fallback op args t tc = case (op, args) of

  -- newByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 False
    pure [MutableByteArray baIdx]

  -- newPinnedByteArray# :: Int# -> State# s -> (# State# s, MutableByteArray# s #)
  ( "newPinnedByteArray#", [IntV size, _s]) -> do
    baIdx <- newByteArray size 1 True
    pure [MutableByteArray baIdx]

It yields rather slow code. That is because GHC doesn't do the "obvious" trie-based match optimisation, which means we end up with a linear chain of comparisons

if op == "newByteArray#" then ,,,
else if op == "newPinnedByteArray#" then ...
...

in Core.

I took a profile (on nofib"s bernoulli, if that matters) and ...ByteArray.evalPrimOp takes about 10% of time and allocation. Here is an excerpt of the profile

COST CENTRE                          MODULE                            SRC                                                       %time %alloc

evalPrimOp                           Stg.Interpreter.PrimOp.ByteArray  lib/Stg/Interpreter/PrimOp/ByteArray.hs:(71,1)-(884,28)     9.4   10.9
evalPrimOp                           Stg.Interpreter.PrimOp.Addr       lib/Stg/Interpreter/PrimOp/Addr.hs:(23,1)-(357,28)          6.5    8.2
evalStackContinuation.\              Stg.Interpreter                   lib/Stg/Interpreter.hs:(355,74)-(394,35)                    4.4    5.8
lookupEnvSO                          Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(630,1)-(648,21)                4.2    2.2
builtinStgEval                       Stg.Interpreter                   lib/Stg/Interpreter.hs:(154,1)-(201,103)                    3.2    2.9
evalPrimOp                           Stg.Interpreter.PrimOp.Float      lib/Stg/Interpreter/PrimOp/Float.hs:(17,1)-(120,28)         2.8    2.9
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(497,45)-(502,23)                    2.8    3.4
evalPrimOp                           Stg.Interpreter.PrimOp.Double     lib/Stg/Interpreter/PrimOp/Double.hs:(17,1)-(127,28)        2.7    3.0
evalExpr                             Stg.Interpreter                   lib/Stg/Interpreter.hs:(423,1)-(533,93)                     2.5    0.6
compare                              Stg.Syntax                        lib/Stg/Syntax.hs:(30,3)-(32,12)                            2.4    0.0
evalExpr.\                           Stg.Interpreter                   lib/Stg/Interpreter.hs:(504,37)-(510,27)                    2.4    1.2
setInsert                            Stg.Interpreter.Base              lib/Stg/Interpreter/Base.hs:(776,1)-(778,36)                2.1    0.0
evalPrimOp                           Stg.Interpreter.PrimOp.Int        lib/Stg/Interpreter/PrimOp/Int.hs:(25,1)-(149,28)           2.0    2.1
evalPrimOp                           Stg.Interpreter.PrimOp.SmallArray lib/Stg/Interpreter/PrimOp/SmallArray.hs:(27,1)-(157,28)    1.9    1.9

I think a bit of focus on optimising evalPrimOp may well speed up the interpreter by 50%. One way to do so would perhaps be to use a HashMap or Trie to do the lookup.

Why optimise anyway? Because at the moment a single run of NoFib's bernoulli benchmark takes about half an hour when the compiled program takes just 0.1s. That's quite a deal breaker for an exhaustive benchmark run of all 11* benchmarks.

csabahruska commented 2 years ago

Hi, I've refactored the primop handling to use hashmap lookup. Try out the interpreter from this branch: https://github.com/grin-compiler/ghc-whole-program-compiler-project/tree/primop-map/external-stg-interpreter

see: https://github.com/grin-compiler/ghc-whole-program-compiler-project/commit/f5e6806a60995e491b8e333d33ec0fa07b1c4867

sgraf812 commented 2 years ago

Wow, that was fast!

And it also resulted in a great speedup, now my reduced benchmark case takes 35s instead of 55s. Great work!

Another perhaps superior suggestion might be to intern all primop names and use an IntMap/Array, but the profile suggests that *.evalPrimOp is no longer a bottle-neck, so why bother.

sgraf812 commented 2 years ago

I think I closed prematurely, after all the changes aren't on master yet. But I think https://github.com/grin-compiler/ghc-whole-program-compiler-project/commit/f5e6806a60995e491b8e333d33ec0fa07b1c4867 fixes the issue.

grin-compiler / ghc-whole-program-compiler-project

evalPrimOp is a bottle-neck due to linear search #8