haskell / vector

An efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimisation framework .
Other
367 stars 139 forks source link

Performance issues with `foldl'` and large tuples #446

Closed julmb closed 2 years ago

julmb commented 2 years ago

I am seeing some performance issues that I find very puzzling.

I derived the following minimal example from my production code which suffers from this.

import qualified Data.Vector as V
import Test.Tasty.Bench

{-# INLINE six #-}
six :: V.Vector (Double, Double) -> Double
six = g . V.foldl' f (0, 0, 0, 0, 0, 0) where
    f (a, b, c, d, e, f) (x, y) = (a + x, b + y, c + 1, d + 1, e + 1, f + 1)
    g (a, b, c, d, e, f) = a + b + c + d + e + f

main :: IO ()
main = defaultMain [bu, bv] where
    bu = bgroup "enum-enum" [bench "6" $ whnf (six . u) n]
    bv = bgroup "enum-replicate" [bench "6" $ whnf (six . v) n]
    {-# INLINE u #-}; u n = V.zip (V.enumFromN 1 n) (V.enumFromN 1 n)
    {-# INLINE v #-}; v n = V.zip (V.enumFromN 1 n) (V.replicate n 1)
    n = 2000000

There are similar tests for tuple sizes 2 through 9, the complete test setup is attached.

Using -O2 -fllvm -optlo-O3 on ghc-8.10.7 and llvm-13.0.1, I get the following results (I also tested without LLVM, with similar results, just slighly slower overall).

enum-enum
  2: 1.72 ms ±  84 μs,   0 B  allocated,   0 B  copied, 2.0 MB peak memory
  3: 1.71 ms ± 106 μs,   0 B  allocated,   0 B  copied, 2.0 MB peak memory
  4: 1.96 ms ±  96 μs,   0 B  allocated,   0 B  copied, 2.0 MB peak memory
  5: 2.03 ms ±  95 μs,   0 B  allocated,   0 B  copied, 2.0 MB peak memory
  6: 48.3 ms ± 2.7 ms, 519 MB allocated, 284 KB copied, 2.0 MB peak memory
  7: 1.951 s ± 112 ms, 1.5 GB allocated, 2.1 GB copied, 987 MB peak memory
  8: 2.385 s ±  23 ms, 1.6 GB allocated, 2.4 GB copied, 1.1 GB peak memory
  9: 2.644 s ± 150 ms, 1.7 GB allocated, 2.8 GB copied, 1.2 GB peak memory
enum-replicate
  2: 2.15 ms ±  86 μs,   0 B  allocated,   0 B  copied, 1.2 GB peak memory
  3: 2.16 ms ± 208 μs,   0 B  allocated,   0 B  copied, 1.2 GB peak memory
  4: 2.01 ms ±  97 μs,   0 B  allocated,   0 B  copied, 1.2 GB peak memory
  5: 1.71 ms ±  90 μs,   0 B  allocated,   0 B  copied, 1.2 GB peak memory
  6: 1.79 ms ±  90 μs,   0 B  allocated,   0 B  copied, 1.2 GB peak memory
  7: 1.717 s ±  67 ms, 1.3 GB allocated, 1.9 GB copied, 1.2 GB peak memory
  8: 2.124 s ±  99 ms, 1.5 GB allocated, 2.2 GB copied, 1.2 GB peak memory
  9: 2.282 s ± 179 ms, 1.6 GB allocated, 2.3 GB copied, 1.2 GB peak memory

In the enum-enum case, everything is very fast up until 5-tuples, then something seems to change, causing massive slowdown and significant memory allocation. In the enum-replicate case (where one of the zipped vectors is now replicate instead of enumFromN), 6-tuples are still fast, but a similar thing to the enum-enum case happens with larger tuples.

I experimented with various strictness annotations, which made almost no difference. Removing the inlining annotations for u and v changes the performance as follows.

enum-enum
  2: 153  ms ± 2.0 ms, 122 MB allocated, 288 MB copied, 179 MB peak memory
  3: 153  ms ± 2.9 ms, 122 MB allocated, 286 MB copied, 179 MB peak memory
  4: 163  ms ±  14 ms, 122 MB allocated, 287 MB copied, 179 MB peak memory
  5: 153  ms ± 2.9 ms, 122 MB allocated, 286 MB copied, 179 MB peak memory
  6: 154  ms ± 5.2 ms, 122 MB allocated, 286 MB copied, 179 MB peak memory
  7: 154  ms ± 2.0 ms, 122 MB allocated, 286 MB copied, 179 MB peak memory
  8: 154  ms ± 3.1 ms, 122 MB allocated, 286 MB copied, 179 MB peak memory
  9: 162  ms ± 4.8 ms, 183 MB allocated, 287 MB copied, 179 MB peak memory
enum-replicate
  2: 127  ms ± 3.0 ms,  91 MB allocated, 210 MB copied, 208 MB peak memory
  3: 128  ms ± 2.8 ms,  91 MB allocated, 210 MB copied, 208 MB peak memory
  4: 128  ms ± 3.5 ms,  91 MB allocated, 210 MB copied, 208 MB peak memory
  5: 128  ms ± 3.6 ms,  91 MB allocated, 210 MB copied, 208 MB peak memory
  6: 129  ms ± 2.8 ms,  91 MB allocated, 210 MB copied, 208 MB peak memory
  7: 129  ms ± 2.1 ms,  91 MB allocated, 211 MB copied, 208 MB peak memory
  8: 129  ms ± 3.7 ms,  91 MB allocated, 211 MB copied, 208 MB peak memory
  9: 135  ms ± 1.6 ms, 153 MB allocated, 211 MB copied, 208 MB peak memory

All cases (except for 9-tuples) now show basically uniform performance. This represents a significant slowdown for the small cases (2-6), and a significant speedup for the large cases (7-9). I am at a complete loss for how lack of inlining can cause such a massive improvement in case of large tuples.

What is going on here? Is this a stream fusion issue? Why does enumFromN behave differently from replicate? From my understanding, both enumFromN and replicate should be readily fusable with foldl'. Or maybe there are some optimizations that only work up to 5-tuples? Then again, that still would not explain why in enum-replicate, 6-tuples are still fast.

In the end, I am very confused and would appreciate any pointers on what is happening here and how I can get the cases with large tuples to run as fast as the ones with small tuples. In principle, there should be no reason for them to be significantly slower, right?

Main.hs.txt

lehins commented 2 years ago

I am more than sure this is not a bug in the library. I can say this with certainty because you are using boxed vectors and in vector boxed vectors do not have any special treatment for tuples of any size, it is completely polymorphic in the type of it's element.

Couple of pointers I can give you:

I can also recommend to ask this question on StackOverflow instead. Issue tracker in general should be reserved for issues and features only and not questions like "why my code is slow?" I'll close this ticket for now, but if you are pretty confident that it is an actual bug in vector then feel free to reopen the ticket and we can dig deep into the problem.

julmb commented 2 years ago

First off, sorry for creating this issue, I should have really posted this on StackOverflow instead. I was so puzzled by this strange behavior with different tuple sizes that I jumped to conclusions, convincing myself that this has to be some weird stream fusion related bug. Thank you for taking the time to give some helpful pointers regardless.

I spent the entire day trying to figure this out and I think I finally got it. I will describe my findings here in case anyone with a similar problem finds this.

It turns out that just like @lehins suspected, this has nothing to do with vector. It also turns out that it has nothing to do with tuples, or even sequences at all. The issue can already be reproduced with a function as simple as this.

ten :: Int -> Double
ten = go 0 0 0 0 0 0 0 0 0 0 where
    go a b c d e f g h i j 0 = a + b + c + d + e + f + g + h + i + j
    go a b c d e f g h i j k = go (a + 1) (b + 1) (c + 1) (d + 1) (e + 1) (f + 1) (g + 1) (h + 1) (i + 1) (j + 1) (k - 1)

Apparently, ghc has a limit on the number of arguments for worker functions which enable dealing with unboxed values. It the number of arguments exceeds that limit, no worker is created and boxed values are used. This can be configured with the option -fmax-worker-args (https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html#ghc-flag--fmax-worker-args=%E2%9F%A8n%E2%9F%A9). With this set to 16 (instead of the default 10), no heap allocation takes place and everything is very fast. No strictness annotations are required for tuple items either, ghc can apparently figure it all out on its own.

My theory is then that zip of replicate and enumFromN ends up with a different number of arguments in the fully inlined and stream-fused function, causing the limit to fire at different tuple sizes for each. This is probably also the reason why in my production code, this only happened with sequences of pairs instead of single values.

Haskell performance really is frustratingly unpredictable and finicky...

lehins commented 2 years ago

@julmb Thank you for posting the solution, I am sure someone will find it helpful in the future. It is not the first time I see the default value for -fmax-worker-args being too low and causing performance issues. I didn't think about it since you were dealing with tuples, but in the end I guess tuples are also special kind of functions.

Haskell performance really is frustratingly unpredictable and finicky...

I totally agree. I am glad you were able to figure it out :+1: