Closed julmb closed 2 years ago
I am more than sure this is not a bug in the library. I can say this with certainty because you are using boxed vectors and in vector
boxed vectors do not have any special treatment for tuples of any size, it is completely polymorphic in the type of it's element.
Couple of pointers I can give you:
foldl'
is strict to weak head normal form (WHNF) only, so you accumulator is still accumulating thunks instead of actual values. My guess is that ghc can figure it out up to a tuple of size 5 or something that it is better to keep tuple values strict, but gives up for tuples of larger size.Unbox
vector instead of boxed for values like Double
and tuples of thereofI can also recommend to ask this question on StackOverflow instead. Issue tracker in general should be reserved for issues and features only and not questions like "why my code is slow?" I'll close this ticket for now, but if you are pretty confident that it is an actual bug in vector
then feel free to reopen the ticket and we can dig deep into the problem.
First off, sorry for creating this issue, I should have really posted this on StackOverflow instead. I was so puzzled by this strange behavior with different tuple sizes that I jumped to conclusions, convincing myself that this has to be some weird stream fusion related bug. Thank you for taking the time to give some helpful pointers regardless.
I spent the entire day trying to figure this out and I think I finally got it. I will describe my findings here in case anyone with a similar problem finds this.
It turns out that just like @lehins suspected, this has nothing to do with vector
. It also turns out that it has nothing to do with tuples, or even sequences at all. The issue can already be reproduced with a function as simple as this.
ten :: Int -> Double
ten = go 0 0 0 0 0 0 0 0 0 0 where
go a b c d e f g h i j 0 = a + b + c + d + e + f + g + h + i + j
go a b c d e f g h i j k = go (a + 1) (b + 1) (c + 1) (d + 1) (e + 1) (f + 1) (g + 1) (h + 1) (i + 1) (j + 1) (k - 1)
Apparently, ghc has a limit on the number of arguments for worker functions which enable dealing with unboxed values. It the number of arguments exceeds that limit, no worker is created and boxed values are used. This can be configured with the option -fmax-worker-args
(https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html#ghc-flag--fmax-worker-args=%E2%9F%A8n%E2%9F%A9). With this set to 16 (instead of the default 10), no heap allocation takes place and everything is very fast. No strictness annotations are required for tuple items either, ghc can apparently figure it all out on its own.
My theory is then that zip
of replicate
and enumFromN
ends up with a different number of arguments in the fully inlined and stream-fused function, causing the limit to fire at different tuple sizes for each. This is probably also the reason why in my production code, this only happened with sequences of pairs instead of single values.
Haskell performance really is frustratingly unpredictable and finicky...
@julmb Thank you for posting the solution, I am sure someone will find it helpful in the future. It is not the first time I see the default value for -fmax-worker-args
being too low and causing performance issues. I didn't think about it since you were dealing with tuples, but in the end I guess tuples are also special kind of functions.
Haskell performance really is frustratingly unpredictable and finicky...
I totally agree. I am glad you were able to figure it out :+1:
I am seeing some performance issues that I find very puzzling.
I derived the following minimal example from my production code which suffers from this.
There are similar tests for tuple sizes 2 through 9, the complete test setup is attached.
Using
-O2 -fllvm -optlo-O3
onghc-8.10.7
andllvm-13.0.1
, I get the following results (I also tested without LLVM, with similar results, just slighly slower overall).In the
enum-enum
case, everything is very fast up until 5-tuples, then something seems to change, causing massive slowdown and significant memory allocation. In theenum-replicate
case (where one of the zipped vectors is nowreplicate
instead ofenumFromN
), 6-tuples are still fast, but a similar thing to theenum-enum
case happens with larger tuples.I experimented with various strictness annotations, which made almost no difference. Removing the inlining annotations for
u
andv
changes the performance as follows.All cases (except for 9-tuples) now show basically uniform performance. This represents a significant slowdown for the small cases (2-6), and a significant speedup for the large cases (7-9). I am at a complete loss for how lack of inlining can cause such a massive improvement in case of large tuples.
What is going on here? Is this a stream fusion issue? Why does
enumFromN
behave differently fromreplicate
? From my understanding, bothenumFromN
andreplicate
should be readily fusable withfoldl'
. Or maybe there are some optimizations that only work up to 5-tuples? Then again, that still would not explain why inenum-replicate
, 6-tuples are still fast.In the end, I am very confused and would appreciate any pointers on what is happening here and how I can get the cases with large tuples to run as fast as the ones with small tuples. In principle, there should be no reason for them to be significantly slower, right?
Main.hs.txt