haskell / vector

An efficient implementation of Int-indexed arrays (both mutable and immutable), with a powerful loop optimisation framework .
Other
365 stars 139 forks source link

Consider design change in light of GHC join point work #156

Open treeowl opened 7 years ago

treeowl commented 7 years ago

The join points paper suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

cartazio commented 7 years ago

... filter can't fuse without skips can it?

On Tue, Feb 21, 2017 at 2:49 PM David Feuer notifications@github.com wrote:

The join points paper https://www.microsoft.com/en-us/research/publication/compiling-without-continuations/ suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwtSu6MWDlo1qGWdrgLswl4Muet-aks5rez-1gaJpZM4MHx21 .

treeowl commented 7 years ago

I believe the join point paper claims the new approach would allow it to do so, yes. Whether that is actually the case as implemented, I don't know.

On Feb 23, 2017 6:02 AM, "Carter Tazio Schonwald" notifications@github.com wrote:

... filter can't fuse without skips can it?

On Tue, Feb 21, 2017 at 2:49 PM David Feuer notifications@github.com wrote:

The join points paper https://www.microsoft.com/en-us/research/publication/compiling-without- continuations/ suggests that it may now be possible to avoid the need for Skip altogether. It would be worth checking if that's actually the case. If so, it might be possible to simplify the implementation considerably. Furthermore, I seem to recall that the presence of skips (which allow vectors of the same length to be represented by streams of different lengths) prevents one or more optimizations. I vaguely remember trying to do something a bit clever with replicate and finding it impossible because of skips, but I don't remember any details.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156, or mute the thread https://github.com/notifications/unsubscribe-auth/ AAAQwtSu6MWDlo1qGWdrgLswl4Muet-aks5rez-1gaJpZM4MHx21 .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-281962713, or mute the thread https://github.com/notifications/unsubscribe-auth/ABzi_d0j-U5DDGe2FmQYYxi0Dk9RF-_Vks5rfWdHgaJpZM4MHx21 .

dolio commented 7 years ago

The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip is no longer needed.

However, making this change would break (fusion-wise) the functions that require Skip on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.

cartazio commented 7 years ago

And we still need to measure impact on compile time/ code side / performance for vector.

That would change how loop breakers etc are chosen sometimes right? On Fri, Feb 24, 2017 at 12:50 AM dolio notifications@github.com wrote:

The paper says the new join point handling makes it so that recursive stepper functions are no longer a problem. So Skip is no longer needed.

However, making this change would break (fusion-wise) the functions that require Skip on every compiler version that doesn't have the new join point handling. I assume that's everything before 8.2, or at best 8.0. So I'm not sure such a change can be justified for quite some time.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-282210427, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAQwr2jc6OW3lYepHs9qySRvlRAcjpGks5rfm-lgaJpZM4MHx21 .

dolio commented 7 years ago

The one way I can think of to accelerate this kind of thing is to decouple the fusion from vectors. Have modules that provide plain vectors, and other modules that somehow add in fusion. Then give a choice of which fusion framework to use.

However, I'm not very sure how to accomplish something like this. The most obvious thing to say is, "this is where you want ML modules," because fusion is sort of a thing that transforms one implementation of an API into another. But backpack is probably more of a limiting factor than the join point stuff. Perhaps a newtype could work:

newtype Fused v a = F (v a)

Where you use Fused Vector a if you want fusion to happen for your choice of Vector.

Anyhow, this is a research project on top of a research project. :)

cartazio commented 5 years ago

@treeowl whats the best way to explore the join point cleanup in your mind?

GeorgeCo commented 3 years ago

Any thoughts on this recently? From the paper it sound very interesting:

Result: simpler code, less of it, and faster to execute. It’s a straight win

Would it be worthwhile to do a small prototype to see e.g. how much it improves runtime and/or compiler performance? Perhaps some of the zip or zipWith functions ? I picked those based on the following from the paper, but I'm not sure if it applies to zip for vectors:

Now the stepper function can say to update the state and call again, obviating the need for a loop of its own. This makes filter fusible, but it complicates everything else! Everything gets three cases instead of two, leading to more code and more runtime tests; and functions like zip that consume two lists become more complicated and less efficient.

lehins commented 3 years ago

It is interesting, but it is also an enormous amount of work. @GeorgeCo You feel like working on this and submitting a PR with a prototype? If I were to work on this I would start with implementing and verifying performance of filter, because that is exactly what we need Skip for.

We already have a plan for putting all the streaming functionality that uses Stream and Step types into a separate package #355 So if this approach with relying on join points proves itself as feasible replacement for newer ghc we could then have two stream packages: one for older ghc with Skip and another package for newer ghc versions that doesn't have to rely on Skip for performance.

noughtmare commented 2 years ago

I've just done a quick scan over the code. Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. Take for example the drop function. I've removed the Step constructor and replaced its usage by a recursive call:

drop :: Monad m => Int -> Stream m a -> Stream m a
{-# INLINE_FUSED drop #-}
drop n (Stream step t) = Stream step' (t, Just n)
  where
    {-# INLINE_INNER step' #-}
    step' (s, Just i) | i > 0 = do
                                  r <- step s
                                  case r of
                                     Yield _ s' -> step' (s', Just (i-1))
                                     Done       -> return Done

    step' (s, _) = liftM (\r ->
                     case r of
                       Yield x s' -> Yield x (s', Nothing)
                       Done       -> Done
                   ) (step s)

But here the step' (s', Just (i - 1)) recursive call is not guaranteed to be a tail call (that depends on what the Monad m is), so GHC can't mark step' as a join point in general (if I understand it correctly).

Or is there another way to write this so that it can use a join point?

cartazio commented 2 years ago

This is a very cool observation, would having a flavor of monads that guarantee they respect tail calls, in a way ghc understands, solve that?

I’ve hit a similar technical issue in some of my own prjects over time, though with different motivations. I’ll ask around

On Mon, Sep 26, 2022 at 11:07 AM Jaro Reinders @.***> wrote:

I've just done a quick scan over the code. It seems to me like it is impossible to remove Skip for monadic streams. Take for example the drop function. I've removed the Step constructor and replaced its usage by a recursive call:

drop :: Monad m => Int -> Stream m a -> Stream m a {-# INLINE_FUSED drop #-}drop n (Stream step t) = Stream step' (t, Just n) where {-# INLINEINNER step' #-} step' (s, Just i) | i > 0 = do r <- step s case r of Yield s' -> step' (s', Just (i-1)) Done -> return Done

step' (s, _) = liftM (\r ->
                 case r of
                   Yield x s' -> Yield x (s', Nothing)
                   Done       -> Done
               ) (step s)

But here the step' (s', Just (i - 1)) recursive call is not guaranteed to be a tail call (that depends on what the Monad m is), so GHC can't mark step' as a join point in general (if I understand it correctly).

— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1258181207, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQXHQZAQDOBSHR2XCUTWAG33LANCNFSM4DA7DW2Q . You are receiving this because you commented.Message ID: @.***>

noughtmare commented 2 years ago

would having a flavor of monads that guarantee they respect tail calls, in a way ghc understands, solve that?

That's exactly what I was thinking!

I wonder what that looks like. I know somebody asked SPJ once after one of his talks if join points could be made explicit in the surface language and SPJ considered the possibility but didn't see an immediate use case.

I also think this is related to Alexis King's "Effects for Less" talk which mentions that it is really important that the monadic bind operator gets inlined.

Perhaps if we use her Eff monad instead of arbitrary monads in the streams then we would not have this issue.

GeorgeCo commented 2 years ago

It might be worthwhile to draw SPJ and the other author of the paper into this conversation.

noughtmare commented 2 years ago

ping @simonpj @pdownen @lukemaurer (I couldn't find Zena Ariola's github account if she even has one)

simonpj commented 2 years ago

Now I'm stuck and it seems to me like it is impossible to remove Skip for monadic streams. T

We are taking about fusion here, correct? That's only going to happen if you inine the function (in this case drop at the call site, else there is nothing to fuse with. And once you inline, you likely have fixed the monad.

I rather doubt that you'll achieve fusion for an arbitrary monad.

A good way to move this converation on would be to provide a concrete example of

Being concrete aids comprehension, and saves misinderstandings.

noughtmare commented 2 years ago

Oh, that sounds right. Now I feel silly for missing that.

To be much more concrete. Take for example the Quickhull benchmark. Before my changes it takes 55.7 ms and after removing Skip it takes 72.2 ms. Looking at the core reveals a bunch of bindings like the following which I think are to blame for this discrepancy. These are not present in the core dumps from before my changes.

letrec {
  $s$wstep'_sebJ :: Int# -> Int# ->
    Id (Step (Int, Int, Maybe (Double, Double)) ((Double, Double), Double))
  $s$wstep'_sebJ
    = \ (sc_sebF :: Int#) (sc1_sebE :: Int#) ->
        case >=# sc1_sebE ww_sdOU of {
          __DEFAULT ->
            case indexDoubleArray# ww2_sdP0 (+# ww1_sdOY sc1_sebE) of wild2_a6SW { __DEFAULT ->
            case indexDoubleArray# ww4_sdP7 (+# ww3_sdP5 sc1_sebE) of wild1_Xe { __DEFAULT ->
            case >=# sc_sebF ww7_sdOz of {
              __DEFAULT ->
                case indexDoubleArray# ipv3_adv1 sc_sebF of wild3_Xh { __DEFAULT ->
                case >## wild3_Xh 0.0## of {
                  __DEFAULT -> $s$wstep'_sebJ (+# sc_sebF 1#) (+# sc1_sebE 1#); -- <<< recursive call here
                  1# ->
                    (Yield
                       ((D# wild2_a6SW, D# wild1_Xe), D# wild3_Xh)
                       (I# (+# sc1_sebE 1#), I# (+# sc_sebF 1#), Nothing))
                    `cast` <Co:15>
                }};
              1# -> Done `cast` <Co:15>
            }}};
          1# -> Done `cast` <Co:15>
        }; } in

Here are the full pre and post dumps of all the benchmarks: bench-dumps.zip

And the particular benchmarks that got noticeably worse are: listrank, awshcc, hybcc, quickhull, findIndexR, findIndexR_manual. Especially awshcc got more than 3x slower and hybcc about 2x slower.

noughtmare commented 2 years ago

Maybe it is some kind of cross-function case of case optimization that is lacking. After reducing an example to just a V.map (* 2) . V.filter (> 10) I get this core:

letrec {
  $wstep'_s3Uf :: Int# -> Id (Step Int Double)
  $wstep'_s3Uf
    = \ (ww2_s3Ud :: Int#) ->
        case >=# ww2_s3Ud ipv1_s3Po of {
          __DEFAULT ->
            case indexDoubleArray# ipv2_s3Pp (+# ipv_s3Pn ww2_s3Ud)
            of wild2_i3d4
            { __DEFAULT ->
            case >## wild2_i3d4 10.0## of {
              __DEFAULT -> $wstep'_s3Uf (+# ww2_s3Ud 1#);
              1# ->
                (Yield (D# wild2_i3d4) (I# (+# ww2_s3Ud 1#))) `cast` <Co:5>
            }
            };
          1# -> Done `cast` <Co:5>
        }; } in
joinrec {
  $s$wfoldlM'_loop_s3WN
    :: State# RealWorld -> Int# -> Int# -> Vector Double
  $s$wfoldlM'_loop_s3WN (sc_s3WM :: State# RealWorld)
                        (sc1_s3WK :: Int#)
                        (sc2_s3WJ :: Int#)
    = case ($wstep'_s3Uf sc1_s3WK) `cast` <Co:4> of { -- <<< $wstep' can't be a join point
        Yield x_a3P4 s'_a3P5 ->
          case x_a3P4 of { D# x1_a2wq ->
          case writeDoubleArray#
                 ipv4_i3En
                 sc2_s3WJ
                 (+## x1_a2wq x1_a2wq)
                 (sc_s3WM `cast` <Co:5>)
          of s'#_i3S3
          { __DEFAULT ->
          case s'_a3P5 of { I# ww3_X4 ->
          jump $s$wfoldlM'_loop_s3WN
            (s'#_i3S3 `cast` <Co:4>) ww3_X4 (+# sc2_s3WJ 1#)
          }
          }
          };
        Done ->
          case unsafeFreezeByteArray# ipv4_i3En (sc_s3WM `cast` <Co:5>) of
          { (# ipv5_i3ci, ipv6_i3cj #) ->
          (Vector 0# sc2_s3WJ ipv6_i3cj) `cast` <Co:5>
          }
      }; } in
jump $s$wfoldlM'_loop_s3WN (ipv3_i3Em `cast` <Co:4>) 0# 0#

Maybe I'm running into https://gitlab.haskell.org/ghc/ghc/-/issues/16335 again?

simonpj commented 2 years ago

Maybe this is too hard, but could you distil out a standalone repro case?

Or (less good but still OK) supply precise repro instructions for your "reduced example". By precise I mean "check out this repo, cd to there, execute these commands..."

Looking at the code you show, I think that FloatIn should move the binding for $wstep' inwards, into the scrutinee of the case; there it will become a join point and good things should happen. I don't know why this doesn't happen.

noughtmare commented 2 years ago

Here's a standalone reproducer:

https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746

with the interesting core here:

https://gist.github.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746#file-all-dump-simpl-L2737-L2794

Commands you can run:

wget https://gist.githubusercontent.com/noughtmare/f979c0fc9eed30abcd0849d19cf0f746/raw/858db7da16a5ba542148d794655057bed33f88b3/All.hs
ghc -package ghc-prim -package primitive -O2 -ddump-simpl -dusppress-all -dno-suppress-type-signatures -ddump-to-file All.hs
noughtmare commented 2 years ago

If I translate the produced core back to Haskell:

module T where

import Control.Monad.Primitive
import Data.Primitive.Array
import System.IO.Unsafe (unsafePerformIO)

data Step s a = Yield a s | Done

uninitialised = undefined

test :: Int -> Int -> Array Double -> (Int, Int, Array Double)
test off n oldArr = unsafePerformIO $ do
  newArr <- newArray n uninitialised
  let
    step' i
      | i >= n = Done
      | otherwise =
        let x = indexArray oldArr (off + i) in
        if x > 10
        then Yield x (i + 1)
        else step' (i + 1)
    loop i j = do
      case step' i of
        Yield x s' -> do
          writeArray newArr j (x + 1)
          loop s' (j + 1)
        Done -> do
          out <- unsafeFreezeArray newArr
          return (0, j, out)
  loop 0 0

And then compile that again I do get the nice core with two join points where the intermediate Yield and Done constructors are eliminated:

$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)
$wtest
  = \ (w :: Int) (ww :: Int#) (w1 :: Array Double) ->
      runRW#
        (\ (s :: State# RealWorld) ->
           case noDuplicate# s of s' { __DEFAULT ->
           case newArray# ww uninitialised (s' `cast` <Co:3>) of
           { (# ipv, ipv1 #) ->
           joinrec {
             $s$wloop
               :: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double)
             $s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#)
               = join {
                   $j :: (Int, Int, Array Double)
                   $j
                     = case unsafeFreezeArray# ipv1 (sc `cast` <Co:3>) of
                       { (# ipv2, ipv3 #) ->
                       lazy (test1, I# sc1, Array ipv3)
                       } } in
                 case >=# sc2 ww of {
                   __DEFAULT ->
                     case w of { I# x ->
                     case w1 of { Array ds2 ->
                     case indexArray# ds2 (+# x sc2) of { (# ipv2 #) ->
                     case ipv2 of { D# x1 ->
                     case >## x1 10.0## of {
                       __DEFAULT ->
                         joinrec {
                           $wstep' :: Int# -> (Int, Int, Array Double)
                           $wstep' (ww1 :: Int#)
                             = case >=# ww1 ww of {
                                 __DEFAULT ->
                                   case indexArray# ds2 (+# x ww1) of { (# ipv3 #) ->
                                   case ipv3 of { D# x2 ->
                                   case >## x2 10.0## of {
                                     __DEFAULT -> jump $wstep' (+# ww1 1#);
                                     1# ->
                                       case writeArray#
                                              ipv1 sc1 (D# (+## x2 1.0##)) (sc `cast` <Co:3>)
                                       of s'#
                                       { __DEFAULT ->
                                       jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# ww1 1#)
                                       }
                                   } } };
                                 1# -> jump $j
                               }; } in
                         jump $wstep' (+# sc2 1#);
                       1# ->
                         case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc `cast` <Co:3>)
                         of s'#
                         { __DEFAULT ->
                         jump $s$wloop (s'# `cast` <Co:2>) (+# sc1 1#) (+# sc2 1#)
                         }
                     } } } } };
                   1# -> jump $j
                 }; } in
           jump $s$wloop (ipv `cast` <Co:2>) 0# 0#
           } })

So is the simplifier just running out of steam?

GeorgeCo commented 2 years ago

Hi Jaro

Good progress on this! wrt

So is the simplifier just running out of steam?

Did you try any of the increasing the count of any of the simplifier options in https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html ?

Unfortunately I am very far from being an expert but the one that looked most likely to be was the following and the associated -fddump-simpl-stats

-fsimpl-tick-factor=⟨n⟩ Default: 100

GHC’s optimiser can diverge if you write rewrite rules (Rewrite rules https://downloads.haskell.org/ghc/latest/docs/users_guide/exts/rewrite_rules.html#rewrite-rules) that don’t terminate, or (less satisfactorily) if you code up recursion through data types (Bugs in GHC https://downloads.haskell.org/ghc/latest/docs/users_guide/bugs.html#bugs-ghc). To avoid making the compiler fall into an infinite loop, the optimiser carries a “tick count” and stops inlining and applying rewrite rules when this count is exceeded. The limit is set as a multiple of the program size, so bigger programs get more ticks. The -fsimpl-tick-factor flag lets you change the multiplier. The default is 100; numbers larger than 100 give more ticks, and numbers smaller than 100 give fewer.

If the tick-count expires, GHC summarises what simplifier steps it has done; you can use -fddump-simpl-stats to generate a much more detailed list. Usually that identifies the loop quite accurately, because some numbers are very large.

https://downloads.haskell.org/ghc/latest/docs/users_guide/using-optimisation.html#ghc-flag--fsimpl-tick-factor=%E2%9F%A8n%E2%9F%A9

On Tue, Sep 27, 2022 at 11:17 AM Jaro Reinders @.***> wrote:

If I translate the produced core back to Haskell:

module T where import Control.Monad.Primitiveimport Data.Primitive.Arrayimport System.IO.Unsafe (unsafePerformIO) data Step s a = Yield a s | Done

uninitialised = undefined test :: Int -> Int -> Array Double -> (Int, Int, Array Double) test off n oldArr = unsafePerformIO $ do newArr <- newArray n uninitialised let step' i | i >= n = Done | otherwise = let x = indexArray oldArr (off + i) in if x > 10 then Yield x (i + 1) else step' (i + 1) loop i sc2 = do let c = step' i case c of Yield x s' -> do writeArray newArr sc2 (x + 1) loop s' (sc2 + 1) Done -> do out <- unsafeFreezeArray newArr return (0, sc2, out) loop 0 0

And then compile that again I do get the nice core with two join points where the intermediate Yield and Done constructors are eliminated:

$wtest :: Int -> Int# -> Array Double -> (Int, Int, Array Double)$wtest = \ (w :: Int) (ww :: Int#) (w1 :: Array Double) -> runRW# (\ (s :: State# RealWorld) -> case noDuplicate# s of s' { DEFAULT -> case newArray# ww uninitialised (s' cast ) of { (# ipv, ipv1 #) -> joinrec { $s$wloop :: State# RealWorld -> Int# -> Int# -> (Int, Int, Array Double) $s$wloop (sc :: State# RealWorld) (sc1 :: Int#) (sc2 :: Int#) = join { $j :: (Int, Int, Array Double) $j = case unsafeFreezeArray# ipv1 (sc cast ) of { (# ipv2, ipv3 #) -> lazy (test1, I# sc1, Array ipv3) } } in case >=# sc2 ww of { DEFAULT -> case w of { I# x -> case w1 of { Array ds2 -> case indexArray# ds2 (+# x sc2) of { (# ipv2 #) -> case ipv2 of { D# x1 -> case >## x1 10.0## of { DEFAULT -> joinrec { $wstep' :: Int# -> (Int, Int, Array Double) $wstep' (ww1 :: Int#) = case >=# ww1 ww of { DEFAULT -> case indexArray# ds2 (+# x ww1) of { (# ipv3 #) -> case ipv3 of { D# x2 -> case >## x2 10.0## of { DEFAULT -> jump $wstep' (+# ww1 1#); 1# -> case writeArray# ipv1 sc1 (D# (+## x2 1.0##)) (sc cast ) of s'# { DEFAULT -> jump $s$wloop (s'# cast ) (+# sc1 1#) (+# ww1 1#) } } } }; 1# -> jump $j }; } in jump $wstep' (+# sc2 1#); 1# -> case writeArray# ipv1 sc1 (D# (+## x1 1.0##)) (sc cast ) of s'# { __DEFAULT -> jump $s$wloop (s'# cast ) (+# sc1 1#) (+# sc2 1#) } } } } } }; 1# -> jump $j }; } in jump $s$wloop (ipv cast ) 0# 0# } })

So is the simplifier just running out of steam?

— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1259576807, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQIJ65QW6AXKBWUDW4Q7ODWAL6XHANCNFSM4DA7DW2Q . You are receiving this because you were mentioned.Message ID: @.***>

noughtmare commented 2 years ago

I did try increasing the max simplifier iterations which did not help.

But now I've rewritten it using the lower level unboxed operations:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}
module T where

import GHC.Exts
import GHC.IO

data Step s a = Yield a s | Done

uninitialised = undefined

test :: Int# -> Int# -> Array# Double -> (# Int#, Int#, Array# Double #)
test off n oldArr = runRW# $ \s0 ->
  case newArray# n uninitialised s0
   of { (# s1, newArr #) ->
  let
    step' i
      | isTrue# (i >=# n) = Done
      | otherwise =
        let (# D# x #) = indexArray# oldArr (off +# i) in
        if isTrue# (x >## 10.0##)
        then Yield (D# x) (I# (i +# 1#))
        else step' (i +# 1#)
    loop i j s2 =
      case step' i of
        Yield x (I# s') ->
          case writeArray# newArr j (x + 1) s2
           of { s3 -> 
          loop s' (j +# 1#) s3
        }
        Done ->
          case unsafeFreezeArray# newArr s2
           of { (# s3, out #) ->
          (# 0#, j, out #)
        }
  in
  loop 0# 0# s1
  }

And compiling this program does show the optimisation failure:

test
  = \ off n oldArr ->
      runRW#
        (\ s0 ->
           case newArray# n uninitialised s0 of { (# ipv, ipv1 #) ->
           letrec {
             step'
               = \ i ->
                   case >=# i n of {
                     __DEFAULT ->
                       case indexArray# oldArr (+# off i) of { (# ipv2 #) ->
                       case ipv2 of wild { D# x ->
                       case >## x 10.0## of {
                         __DEFAULT -> step' (+# i 1#);
                         1# -> Yield wild (I# (+# i 1#))
                       }
                       }
                       };
                     1# -> Done
                   }; } in
           join {
             exit j s2
               = case unsafeFreezeArray# ipv1 s2 of { (# ipv2, ipv3 #) ->
                 (# 0#, j, ipv3 #)
                 } } in
           joinrec {
             loop i j s2
               = case step' i of {
                   Yield x ds1 ->
                     case ds1 of { I# s' ->
                     case writeArray#
                            ipv1 j (case x of { D# x1 -> D# (+## x1 1.0##) }) s2
                     of s3
                     { __DEFAULT ->
                     jump loop s' (+# j 1#) s3
                     }
                     };
                   Done -> jump exit j s2
                 }; } in
           jump loop 0# 0# ipv
           })
noughtmare commented 2 years ago

I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227

cartazio commented 2 years ago

this is great :)

On Tue, Sep 27, 2022 at 11:08 AM Jaro Reinders @.***> wrote:

I think it is time to open a proper GHC issue for this, so I've done that: https://gitlab.haskell.org/ghc/ghc/-/issues/22227

— Reply to this email directly, view it on GitHub https://github.com/haskell/vector/issues/156#issuecomment-1259647533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQVAEPOXFMAY4EJVU4DWAMEWBANCNFSM4DA7DW2Q . You are receiving this because you commented.Message ID: @.***>

noughtmare commented 1 year ago

Good news! I have been able to manually implement the "loopification" optimization that GHC is missing (see this WIP merge request of GHC) with https://github.com/haskell/vector/pull/448/commits/cfaf0d4b0102b294784378bc3d77e6c7ceb2407f. Now almost all benchmarks are the same or faster (tridiag speeds up the most from 86ms to 19.3ms) except awshcc which is still about 2x slower (406ms to 872ms). I'll look into that now. For now, I can already say is that it is able to fuse properly so the problem must be something else.

GeorgeCo commented 1 year ago

Wow !

cartazio commented 1 year ago

Would love to see the table of timing changes

noughtmare commented 1 year ago

I've now reached the point that all benchmarks are (approximately) equal or faster, see https://github.com/haskell/vector/pull/448#issuecomment-1299110013.

lehins commented 1 year ago

@noughtmare That is all pretty great news.

If I understand correctly the proper speed up is only observed with the ghc patch

How does the regression look for all the older ghc versions. We do support all the way down to ghc-8.0

noughtmare commented 1 year ago

@lehins

If I understand correctly the proper speed up is only observed with the ghc patch

No, I've manually implemented the optimization that GHC would do with that patch.

For now I've tested with GHC 9.2.4. I'll do some more tests with other GHC versions, but I believe this only really requires join points (I don't know when those were introduced, but I think GHC 8 already had them).

lehins commented 1 year ago

Now I understand, that is great! Awesome work!

Let's make sure we don't mess up performance for anyone and I don't see a reason why we shouldn't get this change into vector.

sgraf812 commented 1 year ago

It's great to see that you can make a loopified definition work today with GHC, but do note that fusion will break across module boundaries. For example,

module A where
import Data.Stream
drop6 :: Stream a -> Stream a
drop6 xs = drop 6 xs

module B where

import A

main = print $ sum $ drop6 [1,2,3]

note that the stepper of drop6 won't be in a loopified form (because the local, non-recursive binding is probably immediately inlined).

This is kind of the same restriction that we have about foldr/build fusion at the moment, because we rely on compiler phases (too much).