Closed isovector closed 5 years ago
This is part of a bigger project to improve freer's performance. I want to record my attempts here, because it seems like as good a place as any:
After inlining this stuff, the major cost centers become Eff
's fmap
and (<*>)
. The latter, I think, is because (<*>)
calls fmap
.
COST CENTRE MODULE SRC %time %alloc
oneGet Main bench/Core.hs:24:1-68 34.8 23.2
countDown Main bench/Core.hs:(30,1)-(31,71) 33.5 27.5
fmap Control.Monad.Freer.Internal src/Control/Monad/Freer/Internal.hs:(140,3)-(141,39) 14.4 31.9
<*> Control.Monad.Freer.Internal src/Control/Monad/Freer/Internal.hs:(148,3)-(150,41) 12.2 14.5
>>= Control.Monad.Freer.Internal src/Control/Monad/Freer/Internal.hs:(154,3)-(155,28) 2.9 0.0
But this is curious! fmap
and ap are marked INLINE
and aren't recursive, so I don't understand why they aren't actually being inlined.
I went down a rabbit hole on this---presumably the high allocations from fmap
are caused by the FTCQueue
snoc. The shape of this thing under normal circumstances seems to be a linked list---so I tried converting FTCQueue
into a type-aligned finger tree, thinking a more balanced shape would cut down on allocations. The result was 2x slower than FTCQueue
, so I didn't look too deeply at the allocations.
I haven't yet gone down this road, but my next idea is to implement the FTCQueue
as a mutable vector---presumably an arena allocation strategy would outcompete whatever it's doing now. I'll post back on that if/when I have any progress.
Thank you for doing this work, it’s much appreciated.
One thing I worry about is the nature of the microbenchmarks—I wouldn’t be shocked if the inlining were to help in such tiny pieces of code (which create effects and immediately run them) but would be less helpful in real programs (where the inliner is likely to be less aggressive, and the extra inlining might not expose any additional optimizations). I realize this is harder to measure, but have you tried benchmarking your changes on a real program of your own that uses freer-simple
? Do you see a real speedup in those programs, too?
This is certainly a concern; unfortunately I don't have any real-world freer-simple
code on hand (it's all locked up behind closed-source old-employment doors). I'd be happy to run profiling on more realistic code if you can point me to some.
To be honest however; I'm in the middle of writing a guide on why Eff
is actually pretty cool and am addressing potential concerns. For better or worse, one of the primary concern is "Eff
is slow as indicated in these microbenchmarks."
I'm OK holding off on merging this PR---the fact that it exists is probably enough fuel for me for now; "look, here's a PR that gives you 15% improvement in microbenchmarks" should be enough to show people that this isn't fundamentally an issue, so much as an active area of research and optimization.
This project is on the back burner for me right now, so I don’t think I can spare the time to proactively find some better benchmarks, but I’m also still invested in keeping this project maintained, so I don’t want to give the appearance of complete apathy. I appreciate any and all efforts towards making this library better!
Given it’s marketing you’re talking about, I feel I’d be remiss not to mention @robrix’s fused-effects, as it seems to be extremely fast while still being, fundamentally, an extensible effects system. While it doesn’t seem obviously possible to get freer-simple’s nice API using those techniques, I’m not sure that this is fundamental. Maybe we just need some more help from GHC in a way we haven’t quite figured out yet.
A little inlining goes a long way in terms of performance---roughly 15%.
This PR adds
INLINE
pragmas to a few keyInternal
functions, as well as to the effects themselves. It also changes the TH to automatically generateINLINE
pragmas.Benchmark results before:
Benchmark results after: