Open newhoggy opened 5 years ago
well, lets work it out and measure!
https://hackage.haskell.org/package/vector-0.12.0.1/docs/Data-Vector-Fusion-Stream-Monadic.html
write a mapAccumL for the Stream type, and then its super easy to have that be the internal implementation for the rest :)
(modulo a few subtleties, like i think you need to make sure your stream step function is non recursive, but that should be fine, plus theres some other stuff, but i'm here to help (and trick someone into learning stream fusion :) ) )
I've updated my PR with an implementation of mapAccumL
for streams. I couldn't see how I could do it without StateT
. As soon as I use StateT
, I pay a hefty 30x
performance penalty.
See https://github.com/haskell-works/hw-prim/pull/38
Specifically these benchmark results where mapAccumLFusable
is my unsuccessful attempt.
$ ./project.sh bench
$ ./project.sh bench
hw-prim-0.6.2.19: benchmarks
Running 1 benchmarks...
Benchmark bench: RUNNING...
benchmarking medium.csv/mapAccumL for DVS.Vector Word64
time 2.084 ms (2.042 ms .. 2.149 ms)
0.997 R² (0.994 R² .. 0.999 R²)
mean 2.133 ms (2.113 ms .. 2.160 ms)
std dev 79.03 μs (60.33 μs .. 100.7 μs)
variance introduced by outliers: 22% (moderately inflated)
benchmarking medium.csv/mapAccumLViaStrictState for DVS.Vector Word64
time 98.55 ms (97.85 ms .. 99.32 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 99.12 ms (98.70 ms .. 100.4 ms)
std dev 1.053 ms (239.3 μs .. 1.709 ms)
benchmarking medium.csv/mapAccumLViaLazyState for DVS.Vector Word64
time 61.51 ms (60.62 ms .. 62.18 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 62.98 ms (62.47 ms .. 63.64 ms)
std dev 1.086 ms (797.4 μs .. 1.478 ms)
benchmarking medium.csv/mapAccumLFusable for DVS.Vector Word64
time 61.69 ms (61.28 ms .. 61.94 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 62.22 ms (61.86 ms .. 62.65 ms)
std dev 696.0 μs (420.0 μs .. 1.077 ms)
ok, so we need to do some digging it seems.
also perhaps some comparing of Core generated for each of these ... hrmm
(also thanks for getting the ball rolling! )
What are the best options for generating core in this situation.
My prior attempts at generating core have created quite an intelligible monstrosity (at least for me).
i dont remember the precise spelling for flags, though ghc --show-options | grep $FOO
where FOO is words like "dump" or "simpl" or "suppress" or "file" should give you the correct spellings of
-ddump-simple -dsuppress-all -ddump-to-file
however they're spelt
also: adding those [0] and [1] annotations is usually driven by a reason!
(we could try to do some sort of rewrite rule to write back out to the optimized version when we can't fuse, but lets profile using -g3 and perf tools or something to understand whats happening for each of these, or something)
With those annotations, I've tried to keep them the same as what the vector
package does.
I've read a little bit about rewrite rules and how ordering of inlining and rewriting matters, although, I haven't got a good way to debug these.
BTW, thanks for jumping on and helping with this 😀
Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"
You will want the dump to file flag and then upload them as attachments :)
@treeowl , while I understand stream fusion isn’t the flavor you’re most comfy with, do you have any good suggestions John can try to help the fusion stuff ? Or something ? :)
On Thu, Nov 8, 2018 at 11:25 PM John Ky notifications@github.com wrote:
Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/228#issuecomment-437246123, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwrwN5Hlo88QodcBhl9Ts2Sj_eIDVks5utQPOgaJpZM4YVm-z .
One crazy thought I just had is a hybrid version:
Let’s look at the versions where the result is stream structured but one or none of the inputs are.
I think we need to be careful ... our stream benchmarks may be just including something we assume isn’t being recomputed. Or something.
On Thu, Nov 8, 2018 at 11:31 PM Carter Schonwald carter.schonwald@gmail.com wrote:
You will want the dump to file flag and then upload them as attachments :)
@treeowl , while I understand stream fusion isn’t the flavor you’re most comfy with, do you have any good suggestions John can try to help the fusion stuff ? Or something ? :)
On Thu, Nov 8, 2018 at 11:25 PM John Ky notifications@github.com wrote:
Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/228#issuecomment-437246123, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwrwN5Hlo88QodcBhl9Ts2Sj_eIDVks5utQPOgaJpZM4YVm-z .
Fast hand-rolled version: MapAccumL.dump-simpl.gz
Slower lazy StateT based version: MapAccumLFusable.dump-simpl.gz
After some discussion on GHC channel on IRC with mpickering.
Found that:
$s$wfoldrM_loop_sgzz produces a list
which $s$wfoldlM'_loop_sgzs immediately consumes
Data.Vector.Storable.mapM
calls Data.Vector.Generic.mapM
calls Data.Vector.Generic.unstreamM
And unstreamM
is implemented as:
unstreamM :: (Monad m, Vector v a) => MBundle m u a -> m (v a)
{-# INLINE_FUSED unstreamM #-}
unstreamM s = do
xs <- MBundle.toList s
return $ unstream $ Bundle.unsafeFromList (MBundle.size s) xs
Could conversion to list be the cause of the performance issue?
I’m not awake yet so I’ve not caught up on everything. But it’s definitelu true that ghc is much more conservative on simplifying code it things is recursive, so this could be a culprit. I’ll try to investigate a bit today.
On Fri, Nov 9, 2018 at 8:13 AM John Ky notifications@github.com wrote:
Could conversion to list be the cause of the performance issue?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/228#issuecomment-437355757, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwmJLv2R-PbjjROwmCFusWAQ5U1Chks5utX-UgaJpZM4YVm-z .
I tried writing alternate mapM
function implementation that works on Bundle
directly, but I'm stuck on how to deal with the effect m
when trying to implement fc
.
mapM :: forall m v a b . Monad m => (a -> m b) -> Bundle m v a -> Bundle m v b
{-# INLINE_FUSED mapM #-}
mapM f Bundle { sElems = s, sChunks = c, sVector = v, sSize = n} = Bundle
{ sElems = S.mapM f s
, sChunks = S.mapM fc c
, sVector = Nothing
, sSize = n
}
where fc :: Chunk v a -> m (Chunk v b)
fc = undefined
One question: how does producing the final seed affect fusion on the other side? Another question: how do the general problems fusing unstreamM
play here, and is there any way to work around those?
this fell off the wagon but should be revisisted at some point
Yes please 😁
Anything I can do to help move this forward?
I've implemented a hand rolled version, and another two versions based on a combination of
mapM
and the lazy and strict versions ofState
monad.https://github.com/haskell-works/hw-prim/pull/38
The benchmarks show that the hand rolled versions run two times faster than the lazy state monad version and 16 times faster than the strict state monad version.
I found the slow performance of the strict monad version most surprising.
I'm aware that the version that using
mapM
might enable fusion, however it is a fair bit slower than a hand rolled version that defeats fusion.I would love to have a fusion-enabled version that runs as fast as the hand rolled version. Would that be possible?