lexi-lambda / freer-simple

A friendly effect system for Haskell
https://hackage.haskell.org/package/freer-simple
BSD 3-Clause "New" or "Revised" License
228 stars 20 forks source link

inline pragmas for free #21

Closed isovector closed 5 years ago

isovector commented 5 years ago

A little inlining goes a long way in terms of performance---roughly 15%.

This PR adds INLINE pragmas to a few key Internal functions, as well as to the effects themselves. It also changes the TH to automatically generate INLINE pragmas.

Benchmark results before:

Benchmark core: RUNNING...
benchmarking State/freer.get
time                 29.44 ns   (29.32 ns .. 29.62 ns)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 30.45 ns   (29.93 ns .. 31.37 ns)
std dev              2.297 ns   (1.538 ns .. 3.283 ns)
variance introduced by outliers: 86% (severely inflated)

benchmarking State/mtl.get
time                 5.234 ns   (5.174 ns .. 5.293 ns)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 5.254 ns   (5.198 ns .. 5.424 ns)
std dev              269.0 ps   (129.8 ps .. 527.4 ps)
variance introduced by outliers: 76% (severely inflated)

benchmarking State/ee.get
time                 30.36 ns   (29.98 ns .. 30.71 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 30.16 ns   (29.99 ns .. 30.43 ns)
std dev              711.9 ps   (586.6 ps .. 875.8 ps)
variance introduced by outliers: 36% (moderately inflated)

benchmarking Countdown Bench/freer.State
time                 739.5 μs   (727.6 μs .. 752.8 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 752.1 μs   (742.3 μs .. 764.8 μs)
std dev              37.75 μs   (30.53 μs .. 59.97 μs)
variance introduced by outliers: 42% (moderately inflated)

benchmarking Countdown Bench/mtl.State
time                 11.29 μs   (11.19 μs .. 11.38 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 11.21 μs   (11.11 μs .. 11.36 μs)
std dev              403.8 ns   (272.5 ns .. 707.6 ns)
variance introduced by outliers: 44% (moderately inflated)

benchmarking Countdown Bench/ee.State
time                 758.2 μs   (737.7 μs .. 779.1 μs)
                     0.997 R²   (0.996 R² .. 0.999 R²)
mean                 738.2 μs   (733.9 μs .. 746.8 μs)
std dev              19.72 μs   (13.63 μs .. 31.79 μs)
variance introduced by outliers: 16% (moderately inflated)

benchmarking Countdown+Except Bench/freer.ExcState
time                 720.1 μs   (702.1 μs .. 747.1 μs)
                     0.994 R²   (0.990 R² .. 0.998 R²)
mean                 759.2 μs   (742.8 μs .. 779.7 μs)
std dev              59.04 μs   (49.76 μs .. 72.72 μs)
variance introduced by outliers: 64% (severely inflated)

benchmarking Countdown+Except Bench/mtl.ExceptState
time                 7.406 μs   (7.367 μs .. 7.440 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 7.372 μs   (7.343 μs .. 7.419 μs)
std dev              120.9 ns   (88.10 ns .. 183.6 ns)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Countdown+Except Bench/ee.ExcState
time                 733.2 μs   (727.0 μs .. 739.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 742.4 μs   (735.3 μs .. 767.1 μs)
std dev              39.60 μs   (12.59 μs .. 86.60 μs)
variance introduced by outliers: 45% (moderately inflated)

benchmarking HTTP Simple DSL/freer
time                 171.2 ns   (169.4 ns .. 173.2 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 172.4 ns   (170.7 ns .. 174.3 ns)
std dev              5.948 ns   (4.913 ns .. 7.862 ns)
variance introduced by outliers: 52% (severely inflated)

benchmarking HTTP Simple DSL/free
time                 126.0 ns   (124.4 ns .. 128.0 ns)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 124.1 ns   (123.1 ns .. 125.8 ns)
std dev              4.456 ns   (3.123 ns .. 6.568 ns)
variance introduced by outliers: 55% (severely inflated)

benchmarking HTTP Simple DSL/freerN
time                 149.8 μs   (147.3 μs .. 152.1 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 151.9 μs   (150.4 μs .. 154.0 μs)
std dev              6.078 μs   (4.564 μs .. 9.037 μs)
variance introduced by outliers: 39% (moderately inflated)

benchmarking HTTP Simple DSL/freeN
time                 38.33 ms   (36.98 ms .. 39.63 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 40.20 ms   (39.52 ms .. 41.32 ms)
std dev              1.624 ms   (1.206 ms .. 2.241 ms)
variance introduced by outliers: 12% (moderately inflated)

Benchmark core: FINISH

Benchmark results after:

Benchmark core: RUNNING...
benchmarking State/freer.get
time                 23.34 ns   (22.79 ns .. 24.00 ns)
                     0.995 R²   (0.993 R² .. 0.997 R²)
mean                 22.86 ns   (22.35 ns .. 23.57 ns)
std dev              1.952 ns   (1.627 ns .. 2.694 ns)
variance introduced by outliers: 89% (severely inflated)

benchmarking State/mtl.get
time                 5.800 ns   (5.610 ns .. 6.062 ns)
                     0.990 R²   (0.987 R² .. 0.994 R²)
mean                 5.780 ns   (5.651 ns .. 5.970 ns)
std dev              517.8 ps   (420.2 ps .. 621.2 ps)
variance introduced by outliers: 91% (severely inflated)

benchmarking State/ee.get
time                 30.70 ns   (29.16 ns .. 31.88 ns)
                     0.991 R²   (0.988 R² .. 0.996 R²)
mean                 29.44 ns   (28.87 ns .. 30.28 ns)
std dev              2.271 ns   (1.734 ns .. 2.868 ns)
variance introduced by outliers: 87% (severely inflated)

benchmarking Countdown Bench/freer.State
time                 708.1 μs   (698.8 μs .. 719.1 μs)
                     0.996 R²   (0.991 R² .. 0.999 R²)
mean                 734.2 μs   (719.9 μs .. 761.6 μs)
std dev              72.36 μs   (42.16 μs .. 116.2 μs)
variance introduced by outliers: 74% (severely inflated)

benchmarking Countdown Bench/mtl.State
time                 11.21 μs   (11.13 μs .. 11.30 μs)
                     0.999 R²   (0.997 R² .. 0.999 R²)
mean                 11.21 μs   (11.15 μs .. 11.30 μs)
std dev              238.9 ns   (178.3 ns .. 330.7 ns)
variance introduced by outliers: 21% (moderately inflated)

benchmarking Countdown Bench/ee.State
time                 687.5 μs   (682.3 μs .. 693.0 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 708.5 μs   (695.0 μs .. 730.5 μs)
std dev              60.02 μs   (32.65 μs .. 91.12 μs)
variance introduced by outliers: 68% (severely inflated)

benchmarking Countdown+Except Bench/freer.ExcState
time                 698.1 μs   (692.0 μs .. 704.0 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 697.7 μs   (694.4 μs .. 702.6 μs)
std dev              12.86 μs   (9.798 μs .. 18.02 μs)

benchmarking Countdown+Except Bench/mtl.ExceptState
time                 7.383 μs   (7.302 μs .. 7.504 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 7.355 μs   (7.322 μs .. 7.408 μs)
std dev              136.8 ns   (86.83 ns .. 199.8 ns)
variance introduced by outliers: 18% (moderately inflated)

benchmarking Countdown+Except Bench/ee.ExcState
time                 687.2 μs   (682.5 μs .. 692.9 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 703.2 μs   (693.0 μs .. 722.7 μs)
std dev              44.07 μs   (26.06 μs .. 70.85 μs)
variance introduced by outliers: 54% (severely inflated)

benchmarking HTTP Simple DSL/freer
time                 151.1 ns   (148.0 ns .. 156.4 ns)
                     0.997 R²   (0.993 R² .. 1.000 R²)
mean                 149.1 ns   (148.2 ns .. 152.1 ns)
std dev              4.742 ns   (1.635 ns .. 9.768 ns)
variance introduced by outliers: 48% (moderately inflated)

benchmarking HTTP Simple DSL/free
time                 116.9 ns   (116.1 ns .. 117.9 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 116.5 ns   (116.0 ns .. 117.3 ns)
std dev              2.243 ns   (1.478 ns .. 3.181 ns)
variance introduced by outliers: 25% (moderately inflated)

benchmarking HTTP Simple DSL/freerN
time                 147.6 μs   (145.0 μs .. 151.3 μs)
                     0.987 R²   (0.976 R² .. 0.994 R²)
mean                 164.2 μs   (157.2 μs .. 172.9 μs)
std dev              27.40 μs   (20.79 μs .. 34.89 μs)
variance introduced by outliers: 92% (severely inflated)

benchmarking HTTP Simple DSL/freeN
time                 36.67 ms   (33.70 ms .. 41.02 ms)
                     0.979 R²   (0.962 R² .. 1.000 R²)
mean                 34.77 ms   (34.18 ms .. 36.44 ms)
std dev              1.936 ms   (862.1 μs .. 3.567 ms)
variance introduced by outliers: 18% (moderately inflated)

Benchmark core: FINISH
isovector commented 5 years ago

This is part of a bigger project to improve freer's performance. I want to record my attempts here, because it seems like as good a place as any:

After inlining this stuff, the major cost centers become Eff's fmap and (<*>). The latter, I think, is because (<*>) calls fmap.

COST CENTRE    MODULE                       SRC                                                   %time %alloc

oneGet         Main                         bench/Core.hs:24:1-68                                  34.8   23.2
countDown      Main                         bench/Core.hs:(30,1)-(31,71)                           33.5   27.5
fmap           Control.Monad.Freer.Internal src/Control/Monad/Freer/Internal.hs:(140,3)-(141,39)   14.4   31.9
<*>            Control.Monad.Freer.Internal src/Control/Monad/Freer/Internal.hs:(148,3)-(150,41)   12.2   14.5
>>=            Control.Monad.Freer.Internal src/Control/Monad/Freer/Internal.hs:(154,3)-(155,28)    2.9    0.0

But this is curious! fmap and ap are marked INLINE and aren't recursive, so I don't understand why they aren't actually being inlined.

I went down a rabbit hole on this---presumably the high allocations from fmap are caused by the FTCQueue snoc. The shape of this thing under normal circumstances seems to be a linked list---so I tried converting FTCQueue into a type-aligned finger tree, thinking a more balanced shape would cut down on allocations. The result was 2x slower than FTCQueue, so I didn't look too deeply at the allocations.

I haven't yet gone down this road, but my next idea is to implement the FTCQueue as a mutable vector---presumably an arena allocation strategy would outcompete whatever it's doing now. I'll post back on that if/when I have any progress.

lexi-lambda commented 5 years ago

Thank you for doing this work, it’s much appreciated.

One thing I worry about is the nature of the microbenchmarks—I wouldn’t be shocked if the inlining were to help in such tiny pieces of code (which create effects and immediately run them) but would be less helpful in real programs (where the inliner is likely to be less aggressive, and the extra inlining might not expose any additional optimizations). I realize this is harder to measure, but have you tried benchmarking your changes on a real program of your own that uses freer-simple? Do you see a real speedup in those programs, too?

isovector commented 5 years ago

This is certainly a concern; unfortunately I don't have any real-world freer-simple code on hand (it's all locked up behind closed-source old-employment doors). I'd be happy to run profiling on more realistic code if you can point me to some.

To be honest however; I'm in the middle of writing a guide on why Eff is actually pretty cool and am addressing potential concerns. For better or worse, one of the primary concern is "Eff is slow as indicated in these microbenchmarks."

I'm OK holding off on merging this PR---the fact that it exists is probably enough fuel for me for now; "look, here's a PR that gives you 15% improvement in microbenchmarks" should be enough to show people that this isn't fundamentally an issue, so much as an active area of research and optimization.

lexi-lambda commented 5 years ago

This project is on the back burner for me right now, so I don’t think I can spare the time to proactively find some better benchmarks, but I’m also still invested in keeping this project maintained, so I don’t want to give the appearance of complete apathy. I appreciate any and all efforts towards making this library better!

Given it’s marketing you’re talking about, I feel I’d be remiss not to mention @robrix’s fused-effects, as it seems to be extremely fast while still being, fundamentally, an extensible effects system. While it doesn’t seem obviously possible to get freer-simple’s nice API using those techniques, I’m not sure that this is fundamental. Maybe we just need some more help from GHC in a way we haven’t quite figured out yet.

isovector commented 5 years ago

24 is better