haskell / vector

An efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimisation framework .
Other
360 stars 141 forks source link

Add vector version of mapAccumL that behaves like the list version #228

Open newhoggy opened 5 years ago

newhoggy commented 5 years ago

I've implemented a hand rolled version, and another two versions based on a combination of mapM and the lazy and strict versions of State monad.

https://github.com/haskell-works/hw-prim/pull/38

The benchmarks show that the hand rolled versions run two times faster than the lazy state monad version and 16 times faster than the strict state monad version.

I found the slow performance of the strict monad version most surprising.

I'm aware that the version that using mapM might enable fusion, however it is a fair bit slower than a hand rolled version that defeats fusion.

I would love to have a fusion-enabled version that runs as fast as the hand rolled version. Would that be possible?

cartazio commented 5 years ago

well, lets work it out and measure!

https://hackage.haskell.org/package/vector-0.12.0.1/docs/Data-Vector-Fusion-Stream-Monadic.html

write a mapAccumL for the Stream type, and then its super easy to have that be the internal implementation for the rest :)

cartazio commented 5 years ago

(modulo a few subtleties, like i think you need to make sure your stream step function is non recursive, but that should be fine, plus theres some other stuff, but i'm here to help (and trick someone into learning stream fusion :) ) )

newhoggy commented 5 years ago

I've updated my PR with an implementation of mapAccumL for streams. I couldn't see how I could do it without StateT. As soon as I use StateT, I pay a hefty 30x performance penalty.

newhoggy commented 5 years ago

See https://github.com/haskell-works/hw-prim/pull/38

Specifically these benchmark results where mapAccumLFusable is my unsuccessful attempt.

$ ./project.sh bench
$ ./project.sh bench
hw-prim-0.6.2.19: benchmarks
Running 1 benchmarks...
Benchmark bench: RUNNING...
benchmarking medium.csv/mapAccumL               for DVS.Vector Word64
time                 2.084 ms   (2.042 ms .. 2.149 ms)
                     0.997 R²   (0.994 R² .. 0.999 R²)
mean                 2.133 ms   (2.113 ms .. 2.160 ms)
std dev              79.03 μs   (60.33 μs .. 100.7 μs)
variance introduced by outliers: 22% (moderately inflated)

benchmarking medium.csv/mapAccumLViaStrictState for DVS.Vector Word64
time                 98.55 ms   (97.85 ms .. 99.32 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 99.12 ms   (98.70 ms .. 100.4 ms)
std dev              1.053 ms   (239.3 μs .. 1.709 ms)

benchmarking medium.csv/mapAccumLViaLazyState   for DVS.Vector Word64
time                 61.51 ms   (60.62 ms .. 62.18 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 62.98 ms   (62.47 ms .. 63.64 ms)
std dev              1.086 ms   (797.4 μs .. 1.478 ms)

benchmarking medium.csv/mapAccumLFusable        for DVS.Vector Word64
time                 61.69 ms   (61.28 ms .. 61.94 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 62.22 ms   (61.86 ms .. 62.65 ms)
std dev              696.0 μs   (420.0 μs .. 1.077 ms)
cartazio commented 5 years ago

ok, so we need to do some digging it seems.

also perhaps some comparing of Core generated for each of these ... hrmm

cartazio commented 5 years ago

(also thanks for getting the ball rolling! )

newhoggy commented 5 years ago

What are the best options for generating core in this situation.

My prior attempts at generating core have created quite an intelligible monstrosity (at least for me).

cartazio commented 5 years ago

i dont remember the precise spelling for flags, though ghc --show-options | grep $FOO where FOO is words like "dump" or "simpl" or "suppress" or "file" should give you the correct spellings of -ddump-simple -dsuppress-all -ddump-to-file however they're spelt

cartazio commented 5 years ago

also: adding those [0] and [1] annotations is usually driven by a reason!

(we could try to do some sort of rewrite rule to write back out to the optimized version when we can't fuse, but lets profile using -g3 and perf tools or something to understand whats happening for each of these, or something)

newhoggy commented 5 years ago

With those annotations, I've tried to keep them the same as what the vector package does.

I've read a little bit about rewrite rules and how ordering of inlining and rewriting matters, although, I haven't got a good way to debug these.

newhoggy commented 5 years ago

BTW, thanks for jumping on and helping with this 😀

newhoggy commented 5 years ago

Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"

cartazio commented 5 years ago

You will want the dump to file flag and then upload them as attachments :)

@treeowl , while I understand stream fusion isn’t the flavor you’re most comfy with, do you have any good suggestions John can try to help the fusion stuff ? Or something ? :)

On Thu, Nov 8, 2018 at 11:25 PM John Ky notifications@github.com wrote:

Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/228#issuecomment-437246123, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwrwN5Hlo88QodcBhl9Ts2Sj_eIDVks5utQPOgaJpZM4YVm-z .

cartazio commented 5 years ago

One crazy thought I just had is a hybrid version:

Let’s look at the versions where the result is stream structured but one or none of the inputs are.

I think we need to be careful ... our stream benchmarks may be just including something we assume isn’t being recomputed. Or something.

On Thu, Nov 8, 2018 at 11:31 PM Carter Schonwald carter.schonwald@gmail.com wrote:

You will want the dump to file flag and then upload them as attachments :)

@treeowl , while I understand stream fusion isn’t the flavor you’re most comfy with, do you have any good suggestions John can try to help the fusion stuff ? Or something ? :)

On Thu, Nov 8, 2018 at 11:25 PM John Ky notifications@github.com wrote:

Both were generated with ./project.sh bench --ghc-options "-ddump-simpl -dsuppress-all"

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/228#issuecomment-437246123, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwrwN5Hlo88QodcBhl9Ts2Sj_eIDVks5utQPOgaJpZM4YVm-z .

newhoggy commented 5 years ago

Fast hand-rolled version: MapAccumL.dump-simpl.gz

Slower lazy StateT based version: MapAccumLFusable.dump-simpl.gz

newhoggy commented 5 years ago

After some discussion on GHC channel on IRC with mpickering.

Found that:

$s$wfoldrM_loop_sgzz produces a list
which $s$wfoldlM'_loop_sgzs immediately consumes



Data.Vector.Storable.mapM calls Data.Vector.Generic.mapM calls Data.Vector.Generic.unstreamM

And unstreamM is implemented as:

unstreamM :: (Monad m, Vector v a) => MBundle m u a -> m (v a)
{-# INLINE_FUSED unstreamM #-}
unstreamM s = do
                xs <- MBundle.toList s
                return $ unstream $ Bundle.unsafeFromList (MBundle.size s) xs
newhoggy commented 5 years ago

Could conversion to list be the cause of the performance issue?

cartazio commented 5 years ago

I’m not awake yet so I’ve not caught up on everything. But it’s definitelu true that ghc is much more conservative on simplifying code it things is recursive, so this could be a culprit. I’ll try to investigate a bit today.

On Fri, Nov 9, 2018 at 8:13 AM John Ky notifications@github.com wrote:

Could conversion to list be the cause of the performance issue?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/228#issuecomment-437355757, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwmJLv2R-PbjjROwmCFusWAQ5U1Chks5utX-UgaJpZM4YVm-z .

newhoggy commented 5 years ago

I tried writing alternate mapM function implementation that works on Bundle directly, but I'm stuck on how to deal with the effect m when trying to implement fc.

mapM :: forall m v a b . Monad m => (a -> m b) -> Bundle m v a -> Bundle m v b
{-# INLINE_FUSED mapM #-}
mapM f Bundle { sElems = s, sChunks = c, sVector = v, sSize = n} = Bundle
  { sElems  = S.mapM f s
  , sChunks = S.mapM fc c
  , sVector = Nothing
  , sSize   = n
  }
  where fc :: Chunk v a -> m (Chunk v b)
        fc = undefined
treeowl commented 5 years ago

One question: how does producing the final seed affect fusion on the other side? Another question: how do the general problems fusing unstreamM play here, and is there any way to work around those?

cartazio commented 4 years ago

this fell off the wagon but should be revisisted at some point

newhoggy commented 4 years ago

Yes please 😁

newhoggy commented 4 years ago

Anything I can do to help move this forward?