Peformance: mapM_ is very inefficient

harendra-kumar commented 6 years ago

I have been struggling with why conduit performance results are out of the whack in these benchmarks https://github.com/composewell/streaming-benchmarks. I have root caused the issue to mapM_ to discard the stream. I think this usage was picked up from the benchmarks in the machines package and I retained this. mapM_ is also used in other packages like pipes and streaming for the same purpose so I did not find anything wrong with using that, and it made it all the more harder to root cause this issue.

However, there is a drastic improvement in performance when sinkNull is used instead of mapM_. mapM_ is defined like this:

mapM_, mapM_C :: Monad m
              => (a -> m ())
              -> ConduitT a o m ()
mapM_C f = awaitForever $ lift . f

and used like mapM_ (\_ -> return ()) to discard the stream.

sinkNull is:

sinkNull, sinkNullC :: Monad m => ConduitT i o m ()
sinkNullC = awaitForever $ \_ -> return ()

There is not much difference except inline definition of the discarding function. Perhaps the polymorphism in mapM_ is problematic or the lift or is it something else?

harendra-kumar commented 6 years ago

I think there is exactly the same problem with mapM as well and that also explains why the mapM benchmark result for conduit is significantly worse than other libraries.

snoyberg commented 6 years ago

Thanks for the report. I'm mostly AFK this week, so haven't had a chance to look at this. These kinds of optimizations are highly sensitive to GHC inlining rules. Can you clarify exactly which stack.yaml file in your repo, and which commit, you're seeing these figures on?

harendra-kumar commented 6 years ago

The latest commit and stack.yaml should reproduce this.

./run.sh -- -m pattern conduit should run all the conduit benchmarks. You can change this file, e.g. replace sinkNull with mapM_ and run the benchmarks again, this time with the --append option ./run.sh --append -- -m pattern conduit, this will generate a comparison graph between the baseline and your new changes in the charts directory.

snoyberg commented 6 years ago

Alright, I've looked into this, and have a few comments:

Conduit is not, and has never been, a low-CPU-overhead framework. It's about I/O-bound tasks. It is well known and understood that Conduit's implementation puts GC pressure on tight inner loops, which is what your test suite is covering.
Still, speed is nice, so Conduit has a stream fusion framework in it. However, immediately after implementation, it became clear that changes in GHC from version to version make it all but impossible to guarantee firing of this framework. (This point and the previous one are covered in my blog post First class stream fusion.)
The question here got me curious enough to look into this in more detail. The first thing I discovered is that we can make the rewrite rules fire more reliably, which I've done in b134241f3ab545949d2cf5748de3ef012a0266b9.
The setup of the benchmark suite makes it much more difficult for the rewrite rules to fire, preventing most of these optimizations. As an example, I've included a diff below that demonstrates a change that gives a 10x speedup (on my machine at least) for the transformation/map/conduit test.
Furthermore, it seems like it's possible to make that a 30x speedup if you bypass the indirection still present and construct a direct pipeline, as Conduit guides recommend. But this doesn't fit in with the benchmark framework.

In sum: it's nice to get conduit faster via rewrite rules, but it's not the intended use case of the library, and can't be relied upon. We know how to get super fast, reliable streaming code (direct stream fusion, as I laid out in my blog post), but the convenience of higher-level libraries wins out in almost all cases, which is why there doesn't seem to be pressure to make that happen.

diff --git a/Benchmarks/Conduit.hs b/Benchmarks/Conduit.hs
index 5baf369..cda392f 100644
--- a/Benchmarks/Conduit.hs
+++ b/Benchmarks/Conduit.hs
@@ -86,11 +86,13 @@ last   = eliminate $ S.last
 {-# INLINE transform #-}
 transform :: Monad m => Pipe m Int Int -> Source m () Int -> m ()
 -- mapM_ is much more costly compared to sinkNull
+transform t src = S.runConduit (src S..| t S..| S.mapM_ (\_ -> return ()))
 --transform t = runStream (t S..| S.mapM_ (\_ -> return ()))
-transform t = runStream (t S..| S.sinkNull)
+--transform t = runStream (t S..| S.sinkNull)

 scan          = transform $ S.scanl (+) 0
-map           = transform $ S.map (+1)
+--map           = transform $ S.map (+1)
+map           = \src -> S.runConduit $ src S..| S.map (+1) S..| S.mapM_ (\_ -> return ())
 mapM          = transform $ S.mapM return
 filterEven    = transform $ S.filter even
 filterAllOut  = transform $ S.filter (> maxValue)
diff --git a/stack.yaml b/stack.yaml
index 05d93fb..d0b2af2 100644
--- a/stack.yaml
+++ b/stack.yaml
@@ -18,4 +18,10 @@ extra-deps:
   - lens-4.15.4
   - free-4.12.4
   - drinkery-0.3
+
+  - git: https://github.com/snoyberg/conduit
+    commit: b134241f3ab545949d2cf5748de3ef012a0266b9
+    subdirs:
+    - conduit
+
 rebuild-ghc-options: true

harendra-kumar commented 6 years ago

I agree with your observations and that's why I do not have any recommendations of streaming libraries based on these performance numbers. They are to be used with caution. In fact I have this caveat in the README to warn against the tendency to misuse micro-benchmarks such as this:

When choosing a streaming library to use we should not be over obsessed about the performance numbers as long as the performance is within reasonable bounds. Whether the absolute performance or the differential among various libraries matters or not may depend on your workload. If the cost of processing the data is significantly higher compared to the streaming overhead then the difference may not matter at all; unless you are performing huge number of tiny operations, performance difference may not be significant.

The goal of these benchmarks was to explicitly measure just the cost of operations discounting fusion to not complicate the comparison/explanation of results. In fact I designed them such that fusion is avoided. So, good to know that its not working. I was also considering to design separate benchmarks to show the fusion capabilities of libraries, but that's not done yet.

My main point here was that since sinkNull and mapM_ can be used for the same job interchangeably, it may result in significant difference in perf that can be potentially introduced inadvertently, the perf diff depends on the use case of course. If it is possible to fix it, it is nice to have that fix. Exactly the same problem is seen with pipes as well, there it isdiscard vs mapM_.

shlok commented 5 years ago

I encountered a situation in practice where

someConduit .| awaitForever (\_ -> return ())

performs substantially worse than

someConduit .| sinkNull

I guess this is related to what you discussed above, as sinkNull is implemented under the hood exactly as awaitForever (\_ -> return ()) apart from inlining differences.

Have there been any further developments on this matter in the past year?

snoyberg / conduit

Peformance: mapM_ is very inefficient #363