`dunai`: `morphGS` is probably inefficient

turion commented 1 year ago

While investigating https://github.com/turion/rhine/issues/227, I found that replacing the current definition of arrM in terms of morphGS with a direct definition gives a considerable speedup, and I'm trying here to explain where this speedup might come from. I'd recommend replacing the definition, because arrM is ubiquitous in every dunai program.

The implementations:

https://github.com/ivanperez-keera/dunai/blob/fc0559b0658adb868d474a5421d8d6dde5bb2ca8/dunai/src/Data/MonadicStreamFunction/Core.hs#LL122C1-L130C72

arrM :: Monad m => (a -> m b) -> MSF m a b
arrM f =
  -- This implementation is equivalent to:
  -- arrM f = go
  --   where
  --     go = MSF $ \a -> do
  --            b <- f a
  --            return (b, go)
  morphGS (\i a -> i a >>= \(_, c) -> f a >>= \b -> return (b, c)) C.id

The implementation in terms of morphGS has the advantage of not using the MSF constructor, but it is also much slower in runtime.

Lengthy derivation of example program evaluation

## Example program Let's consider this simple example program which creates a random string (using a fictitious function `rand :: IO String`) and prints it: ```haskell reactimate $ arrM (const rand) >>> arrM putStrLn ``` Let us expand the definitions of the functions to see how this is evaluated and executed: ```haskell = let msf = arrM (const rand) >>> arrM putStrLn in reactimate msf = let msf = arrM (const rand) >>> arrM putStrLn in do (_, msf') <- unMSF msf () reactimate msf' = let msf = MSF $ \a -> do (b, sf1') <- unMSF (arrM (const rand)) a (c, sf2') <- unMSF (arrM putStrLn) b c `seq` return (c, sf1' >>> sf2') in do (_, msf') <- unMSF msf () reactimate msf' = do (b, sf1') <- unMSF (arrM (const rand)) () (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' ``` Now it depends on how `arrM` is defined. Let us chose the simple direct implementation first (the one that is not used in the library): ## The direct implementation ```haskell = do (b, sf1') <- let go = MSF $ \a -> do b <- const rand a return (b, go) in unMSF go () (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do (b, sf1') <- let go = MSF $ \a -> do b <- const rand a return (b, go) in do b <- const rand () return (b, go) (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do -- Let's not care too much where the let floats, GHC will figure out the best place let go = MSF $ \a -> do b <- const rand a return (b, go) b <- const rand () (b, sf1') <- return (b, go) (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do -- Actually do some IO! rand produces value r0 let go = MSF $ \a -> do b <- const rand a return (b, go) (b, sf1') <- return (r0, go) (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do let go = MSF $ \a -> do b <- const rand a return (b, go) (c, sf2') <- unMSF (arrM putStrLn) r0 (_, msf') <- c `seq` return (c, go >>> sf2') reactimate msf' = do let go = MSF $ \a -> do b <- const rand a return (b, go) (c, sf2') <- let go' = MSF $ \b -> do c <- putStrLn b return (c, go') in unMSF go' r0 (_, msf') <- c `seq` return (c, go >>> sf2') reactimate msf' = do let go = MSF $ \a -> do b <- const rand a return (b, go) (c, sf2') <- let go' = MSF $ \b -> do c <- putStrLn b return (c, go') in do c <- putStrLn r0 return (c, go') (_, msf') <- c `seq` return (c, go >>> sf2') reactimate msf' = do let go = MSF $ \a -> do b <- const rand a return (b, go) let go' = MSF $ \b -> do c <- putStrLn b return (c, go') (c, sf2') <- do c <- putStrLn r0 return (c, go') (_, msf') <- c `seq` return (c, go >>> sf2') reactimate msf' = do -- Output r0 let go = MSF $ \a -> do b <- const rand a return (b, go) let go' = MSF $ \b -> do c <- putStrLn b return (c, go') (c, sf2') <- return ((), go') (_, msf') <- c `seq` return (c, go >>> sf2') reactimate msf' = do let go = MSF $ \a -> do b <- const rand a return (b, go) let go' = MSF $ \b -> do c <- putStrLn b return (c, go') (_, msf') <- () `seq` return ((), go >>> go') reactimate msf' = do let go = MSF $ \a -> do b <- const rand a return (b, go) let go' = MSF $ \b -> do c <- putStrLn b return (c, go') reactimate (go >>> go') ``` We've created two thunks, one for each component of the whole `MSF`, and executed both their bodies. From here on, the program will go on endlessly, and no new thunks are generated. ```haskell = do let go = MSF $ \a -> do b <- const rand a return (b, go) let go' = MSF $ \b -> do c <- putStrLn b return (c, go') (_, msf') <- unMSF (go >>> go') () reactimate msf' = ... ``` Maybe GHC does further optimizations (like inlining something or optimizing the `return`), I don't know. I didn't look at the core because I don't understand that well yet. Now what happens if we define `arrM` in the current way? ## The library implementation ```haskell ... = do (b, sf1') <- unMSF (arrM (const rand)) () (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do (b, sf1') <- unMSF ((\f -> morphGS (\i a -> i a >>= \(_, c) -> f a >>= \b -> return (b, c)) C.id) (const rand)) () (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do (b, sf1') <- unMSF (morphGS (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) C.id) () (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do (b, sf1') <- let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') in unMSF (go C.id) () (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do (b, sf1') <- let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') in do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF C.id) () return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF C.id) () (b, sf1') <- return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') (b2, msf') <- let pap = unMSF C.id in pap () >>= \(_, c) -> const rand a >>= \b -> return (b, c) (b, sf1') <- return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' ``` Note that we have now created a thunk for the partial application `unMSF msf`. ```haskell = do let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') pap = unMSF C.id goId = MSF $ \a -> return (a, goId) (b2, msf') <- return ((), goId) >>= \(_, c) -> const rand a >>= \b -> return (b, c) (b, sf1') <- return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' ``` There are no references to the partial application `pap` anymore, it needs to be garbage collected! ```haskell = do let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') goId = MSF $ \a -> return (a, goId) (b2, msf') <- return ((), goId) >>= \(_, c) -> const rand a >>= \b -> return (b, c) (b, sf1') <- return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') let goId = MSF $ \a -> return (a, goId) (b2, msf') <- const rand () >>= \b -> return (b, goId) (b, sf1') <- return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do -- Actually do side effect let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') let goId = MSF $ \a -> return (a, goId) (b2, msf') <- return (r0, goId) (b, sf1') <- return (b2, go msf') (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' = do let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') let goId = MSF $ \a -> return (a, goId) (b, sf1') <- return (r0, go goId) (c, sf2') <- unMSF (arrM putStrLn) b (_, msf') <- c `seq` return (c, sf1' >>> sf2') reactimate msf' ``` What's interesting here is that we've created _two_ thunks for something where we only needed one thunk before. And worse, there was a third thunk from the partial application, which then needs to be garbage collection. So we might already suspect that (in the absence of clever optimizations), this version will use a constant factor more space, and put load on the garbage collector. For the additional thunks, the reason is not so much the definition in terms of helper functions, but `morphGS` being _higher order_ in the `MSF`! It takes an `MSF` as input (in this case a trivial identity function), and that needs to be carried around as well now. I also believe that it cannot be optimized away easily, because `morphGS` doesn't apply `unMSF` fully (like e.g. `>>>` does), but only partially, `unMSF msf`. That way the optimizer cannot inline the definition of the `MSF`. But this is speculation, I don't really understand the optimizer in detail, and haven't looked at the Core. Also, this partial application seems to trigger the garbage collector. Anyways, let's continue the evaluation. Since the other `arrM` is defined in the same way, one might think that it would analogously produce a further thunk, but I believe that `id` is shared: ```haskell = do -- In analogy to the last derivation let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') let goId = MSF $ \a -> return (a, goId) let go' msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> putStrLn a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go' msf') (_, msf') <- () `seq` return ((), go goId >>> go' goId) reactimate msf' = do -- In analogy to the last derivation let go msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> const rand a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go msf') let goId = MSF $ \a -> return (a, goId) let go' msf = MSF $ \a2 -> do (b2, msf') <- (\i a -> i a >>= \(_, c) -> putStrLn a >>= \b -> return (b, c)) (unMSF msf) a2 return (b2, go' msf') reactimate (go goId >>> go' goId) ```

Summary

I believe that with the current version, we probably use more time, space, and we introduce garbage collection in the first place although there is no internal state. The red flags are probably either higher order functions in MSF or partial application of unMSF. This is all theoretical, but it fits the observations in https://github.com/turion/rhine/issues/227.

ivanperez-keera commented 1 year ago

@turion thanks a lot for this very detailed report. I know it's a lot to ask on top of what you've already done, but would you be able to put together a small benchmark that shows the difference in terms of performance with one and the other?

turion commented 1 year ago

Yes, I managed to find a benchmark using ListT that reproduces the issue, I believe. It's much more visible there because many MSFs are launched.

Reproduce with cabal run spaceleak --enable-profiling -- +RTS -hc -l-agu -i0.01 && eventlog2html spaceleak.eventlog on branch https://github.com/turion/dunai/tree/dev_spaceleak.

Library on branch `develop`, with benchmark added

89d4125fca045bd0f332b4323729e95066791c11 Adding spaceleak

Replaced `arrM` definition with direct implementation

110e026 Write out arrM

Huge performance win (time per tick is reduced a lot)
Peak memory usage reduced from 28 to 21 MB
Space leak still visible

Heavy inlining and strictification in many places (shotgun optimization)

11ba331 Inline many functions

Again huge performance win
Peak memory usage further reduced
Space leak still visible
Unclear yet which of the many optimizations had an effect, further research necessary

turion commented 1 year ago

The branch may serve as a starting point for further future optimizations, of which arrM is only one.

turion commented 1 year ago

Further bisecting of all the possible inlines suggests that inlining arr is also a huge performance win.

turion commented 1 year ago

My current strategy to identify the minimal set of necessary changes is:

Bisect on the profiling result until a commit is identified that improves performance
Cherry-pick that commit onto a branch improvements
Rebase the whole branch with all inlinings onto improvements
Repeat until the performance improving commit is already on improvements

ivanperez-keera commented 11 months ago

Just a note: Issue #375 addressed the introduction of a benchmark. It will require discussion before making changes, but it may be worth seeing if those benchmarks, or how they are used, should be improved.

turion commented 11 months ago

Trying the benchmarks briefly seems to show that these changes here have a significant impact on the benchmarks. I think it would be very helpful if we could plot benchmarks (+ error bars) for several commits to compare them.

ivanperez-keera / dunai