haskell / core-libraries-committee

95 stars 16 forks source link

stimes for Endo #4

Closed treeowl closed 3 years ago

treeowl commented 3 years ago

Currently,

instance Semigroup (Endo a) where
    ...
    stimes = stimesMonoid

stimesMonoid produces a balanced tree. I guess this enhances sharing within the resulting function (if it's realized as a closure), but I'm wondering if it's really a good idea, especially when the argument is small.

Here's a simpler option to play with:

stimes n _ | n < 0 = error "Cannot apply a function a negative number of times."
stimes n0 e = go (fromIntegral n0)
  where
    go 0 = mempty
    go n = e <> go (n - 1 :: Word64)

Another possibility might be to make a tree that only leans to the right. That would get immediate access to the leftmost (most important) nodes while maintaining sharing for the large case.

treeowl commented 3 years ago

For example, we could (sort of) borrow from Okasaki's random access lists, lazily converting the number to skew binary and tacking the bitCount extras on the front in unary.

treeowl commented 3 years ago

Heck, a balanced tree in pre-order should be better than the current leaf tree.

Bodigrim commented 3 years ago

@treeowl could you possibly please elaborate what exactly is suboptimal with the current approach, and what and how can become better with alternative ones?

treeowl commented 3 years ago

@Bodigrim, suppose I write appEndo (Endo f `stimes` n). This will produce something like this:

((f . f) . (f . f)) . ((f . f) . (f . f))

This is notionally a tree, but with shared subtrees. So the space usage is very good.

To actually make progress applying this to an argument, however, requires walking down the left spine of the tree to reach a leaf. This doesn't seem ideal; I would expect to reach the kth function call in O(k) time. The easiest way to achieve this is to give up on sharing. To achieve it with sharing seems considerably trickier, but achievable. So far I've only figured out how to do it (I think) with a sort of zeroless skew binary system, and I'm still working out some fenceposts. I can post the result when I'm done if you like.

treeowl commented 3 years ago

@Bodigrim, here's a "fancy" way. It will need some adjustments to support arbitrary Integral types if we decide to do something like this. (Basically, we'll need to come up with some fall-back for the case where the passed value is greater than maxBound :: Word64.)

import Data.Bits (popCount)

stimesEndo :: Int -> Endo a -> Endo a
stimesEndo n0 x0 = case compare n0 0 of
   LT -> error "Negative multiplier."
   EQ -> mempty
   GT -> start (popCount n0) (n0 `quot` 2)
  where
    start 1 0 = x0
    start 0 1 = x0
    start 0 n = f x0 n
    start k n = x0 <> start (k - 1) n

    f acc 1 = acc
    f acc n
      | odd n = acc <> f acc' (n `quot` 2)
      | otherwise = acc <> acc <> g acc' (n `quot` 2)
      where
        acc' = x0 <> acc <> acc

    g acc 1 = x0
    g acc n
      | odd n = x0 <> f acc' (n `quot` 2)
      | otherwise = x0 <> acc <> g acc' (n `quot` 2)
      where
        acc' = x0 <> acc <> acc

Key points:

  1. The accumulator is always updated to x0 <> acc <> acc rather than to acc <> acc. This means that we never have to dig deeper to get the next function than we've "paid for" in an amortized sense.
  2. We produce function blobs even on 0 bits, for the same reason.

I would not be at all surprised if there were a simpler way; this is just what I was able to come up with.

treeowl commented 3 years ago

Ah, here's a simplification! It avoids the popCount complications by doing that along the way.

stimesEndo :: Integral n => n -> Endo a -> Endo a
stimesEndo n0 x0 = case compare n0 0 of
   LT -> error "Negative"
   EQ -> mempty
   GT
     | n0 == 1 -> x0
     | even n0 -> f x0 (n0 `quot` 2)
     | otherwise -> x0 <> f x0 (n0 `quot` 2)
  where
    f acc 1 = x0 <> acc
    f acc n
      | odd n = x0 <> acc <> f acc' (n `quot` 2)
      | otherwise = x0 <> acc <> acc <> g acc' (n `quot` 2)
      where
        acc' = x0 <> acc <> acc

    g acc 1 = x0
    g acc n
      | odd n = x0 <> f acc' (n `quot` 2)
      | otherwise = x0 <> acc <> g acc' (n `quot` 2)
      where
        acc' = x0 <> acc <> acc
treeowl commented 3 years ago

Informal tests suggest my version is around twice as fast as the current version, so it's not just a theoretical improvement.

treeowl commented 3 years ago

I've opened an MR for review.

treeowl commented 3 years ago

Oh duh... Never mind that. The simple way is obviously better. Here's a new MR.

cgibbard commented 3 years ago

It would be really interesting to see performance tests on this, but overall, this looks like a good idea to me. If we can beat the default, let's do it.

treeowl commented 3 years ago

@cgibbard , I just did some informal tests with Endo (1:) and Endo (1+) and found them pretty convincing. Give it a go!

Bodigrim commented 3 years ago

Let me write down my understanding of the proposal.

Semigroup.stimes is a short-cut for sconcat (replicate n a). The latter, implemented naively, takes O(n) time to produce a result.

stimesMonoid implements stimes n a = sconcat (replicate n a) by building a, a2 = a <> a, a4 = a2 <> a2, a8 = a4 <> a4, and concatenating intermediate values corresponding to non-zero binary digits of n. This usually gives us O(log n) complexity.

In case of newtype Endo a = Endo { appEndo :: a -> a } we can indeed build a thunk, corresponding to stimes n f, in O(log n) time. However, we rarely wish to build such thunk just to adorn our heap, and most likely we intend to apply it to an argument. Unless f = const foo, evaluation of appEndo (stimes n f) x must invoke f repeatedly n times. However, the binary tree built by stimesMonoid is likely to hide from GHC the exact nature of computation and cause high constant factor.

The proposal is to implement stimes for Endo not via stimesMonoid, but rather via a generalized version of stimesList. This has two-fold advantage: a thunk itself is built in O(1), and its evaluation, while still O(n), is much friendlier to GHC optimizer and runtime, likely to lead to performance gains.

@treeowl, is my rendering correct? Could you please share results for Endo (1+) benchmarks?

treeowl commented 3 years ago

@Bodigrim, yes, that's correct. I don't have proper benchmarks to show; I could probably try to produce some this weekend. The current way seems to be somewhere around three-ish times slower than the simple way.

gwils commented 3 years ago

I'm interested to see these benchmarks too, but it looks fine to me. Let's grab performance wins when we can.

Bodigrim commented 3 years ago

I made a simplistic benchmark

module Main where

import Test.Tasty.Bench
import Data.Semigroup
import Data.Word

newStimes n _ | n < 0 = error "Cannot apply a function a negative number of times."
newStimes n0 e = go (fromIntegral n0)
  where
    go 0 = mempty
    go n = e <> go (n - 1 :: Word64)

main :: IO ()
main = defaultMain $ map mkGroup [1, 10, 100, 1000, 10000, 100000]

mkGroup :: Int -> Benchmark
mkGroup n = bgroup (show n)
  [ bench "current" $ nf (appEndo (stimes n (Endo (+1)))) (1 :: Int)
  , bcompare ("$NF == \"current\" && $(NF-1)==\"" ++ show n ++ "\"") $
    bench "proposed" $ nf (appEndo (newStimes n (Endo (+1)))) (1 :: Int)
  ]

Results are

All
  1
    current:  OK (1.33s)
      4.94 ns ± 448 ps
    proposed: OK (1.91s)
      7.13 ns ± 482 ps, 1.44x
  10
    current:  OK (2.04s)
      120  ns ± 9.0 ns
    proposed: OK (1.43s)
      21.1 ns ± 1.8 ns, 0.18x
  100
    current:  OK (2.79s)
      1.31 μs ±  66 ns
    proposed: OK (1.69s)
      201  ns ±  17 ns, 0.15x
  1000
    current:  OK (1.82s)
      13.8 μs ± 1.3 μs
    proposed: OK (1.66s)
      1.57 μs ± 139 ns, 0.11x
  10000
    current:  OK (1.76s)
      214  μs ±  18 μs
    proposed: OK (1.71s)
      52.0 μs ± 4.5 μs, 0.24x
  100000
    current:  OK (2.84s)
      5.56 ms ± 216 μs
    proposed: OK (2.85s)
      691  μs ±  68 μs, 0.12x

Could we improve case of stimes 1?

cgibbard commented 3 years ago

What happens if you add a special case: go 1 = e

treeowl commented 3 years ago

It's worth testing also with Endo (() :) and (for very large n) Endo (const ()).

Bodigrim commented 3 years ago

Feel free to benchmark, source code is all there.

cgibbard commented 3 years ago

Adding special cases to go doesn't work, but adding the branch at the top level does:

module Main where

import GHC.IO.Encoding
import Test.Tasty.Bench
import Data.Semigroup
import Data.Word

newStimes :: (Integral b, Monoid m) => b -> m -> m
newStimes n _ | n < 0 = error "Cannot apply a function a negative number of times."
newStimes 0 e = mempty
newStimes 1 e = e
newStimes n0 e = go (fromIntegral n0)
  where
    go 0 = mempty
    go n = e <> go (n - 1 :: Word64)

main :: IO ()
main = do
  setLocaleEncoding utf8
  defaultMain $ map mkGroup [0, 1, 10, 100, 1000, 10000, 100000]

mkGroup :: Int -> Benchmark
mkGroup n = bgroup (show n)
  [ bench "current" $ nf (appEndo (stimes n (Endo (+1)))) (1 :: Int)
  , bcompare ("$NF == \"current\" && $(NF-1)==\"" ++ show n ++ "\"") $
    bench "proposed" $ nf (appEndo (newStimes n (Endo (+1)))) (1 :: Int)
  ]

gives:

All
  0
    current:  OK (0.19s)
      5.5 ns ± 460 ps
    proposed: OK (0.19s)
      5.5 ns ± 434 ps, 1.00x
  1
    current:  OK (0.21s)
      6.2 ns ± 470 ps
    proposed: OK (0.21s)
      6.2 ns ± 416 ps, 1.00x
  10
    current:  OK (0.26s)
      122 ns ± 8.7 ns
    proposed: OK (0.25s)
       29 ns ± 2.2 ns, 0.24x
  100
    current:  OK (0.17s)
      1.3 μs ±  96 ns
    proposed: OK (0.25s)
      238 ns ±  11 ns, 0.19x
  1000
    current:  OK (0.24s)
       14 μs ± 688 ns
    proposed: OK (0.16s)
      2.4 μs ± 189 ns, 0.17x
  10000
    current:  OK (0.12s)
      222 μs ±  21 μs
    proposed: OK (0.11s)
       58 μs ± 5.7 μs, 0.26x
  100000
    current:  OK (0.18s)
      5.7 ms ± 516 μs
    proposed: OK (0.20s)
      799 μs ±  50 μs, 0.14x

All 14 tests passed (2.77s)

(Note that there was also a slowdown in the 0 case originally.)

treeowl commented 3 years ago

My current draft implementation is here: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863

It would probably make sense to benchmark that.

treeowl commented 3 years ago

The main special bit about my version is that stimes is guaranteed to have arity 3 even if GHC switches to -fpedantic-bottoms.

treeowl commented 3 years ago

I finally got https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863 to compile, so I'd appreciate reviews. Can anyone remind me how to add a perf test for GHC?

treeowl commented 3 years ago

@Bodigrim , since you asked, I'm specifically seeking approval for this: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863

treeowl commented 3 years ago

I just incorporated @cgibbard's adjustment.

Bodigrim commented 3 years ago

@treeowl the linked change does not compile.

treeowl commented 3 years ago

Yeah, minor error in an update; I'll fix it now.

treeowl commented 3 years ago

@Bodigrim, it should be fixed now. It was a very silly mistake.

treeowl commented 3 years ago

I just realized something: stimes for Endo is lousy when applied to strict functions. There are two ways to approach this:

  1. Use a different StrictEndo type (in base or not) with a different stimes implementation.
  2. Use rewrite rules to (try to) detect when stimes is applied to a strict function.

I'm leaning toward (1), but if we want to do (2), here's how it would work, roughly (I've actually made this sort of thing work before, but it's a bit fiddly):

Define

{-# NOINLINE knownBottom #-}
knownBottom :: a
knownBottom = errorWithoutStackTrace "This should never be reached"

Then add some rewrite rules:

{-# RULES
"stimesEndo" [~1] forall n e. stimesEndo n (Endo e) = stimesEndoMagic n e (e knownBottom)
"stimesEndoMagic" [~1] forall n e. stimesEndoMagic n e knownBottom = stimesEndo' n e
 #-}

stimesEndoMagic n e _ = stimesEndo n e
{-# INLINE [1] stimesEndoMagic #-}

-- A verson of stimesEndo for strict functions
stimesEndo' :: Int -> Endo a -> Endo a
stimesEndo' n0 (Endo e) a0 = case n0 of
  _ | n0 < 0 -> errorWithoutStackTrace "negative"
  0 -> a0
  1 -> e a0
  n -> go n0 a
  where
    go 0 a = a
    go n a = go n $! e a
Bodigrim commented 3 years ago

I certainly do not want to enshrine such black magic in base without a very strong justification, so I suggest we stick to the original scope.


According to https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863/diffs @treeowl seeks CLC's approval to change stimes for Endo a from stimesMonoid to

    stimes n0 (Endo e) = Endo (\a0 ->
      -- We check separately for 0 and 1 per
      -- https://github.com/haskell/core-libraries-committee/issues/4#issuecomment-955605592
      case n0 of
        _ | n0 < 0 -> stimesEndoError
        0 -> a0
        1 -> e a0
        _ -> go n0 a0)
      where
        go 0 a = a
        go n a = e (go (n - 1) a)

{-# NOINLINE stimesEndoError #-}
-- There's no reason to put this gunk in the unfolding.
stimesEndoError :: a
stimesEndoError = errorWithoutStackTrace "stimes (for Endo): negative multiplier"

(Other changes in the same MR are tracked in #8 and will require a separate approval).

@treeowl would you like to trigger vote immediately?

treeowl commented 3 years ago

@Bodigrim , yes please.

Bodigrim commented 3 years ago

Dear CLC members, please vote below on the proposal to change the implementation of stimes for Endo as per https://github.com/haskell/core-libraries-committee/issues/4#issuecomment-956596027.

CC @cgibbard @cigsender @gwils @chessai @emilypi

cgibbard commented 3 years ago

Did anyone rerun the benchmarks on the actual code? If it's an improvement or equivalent for the basic cases we tried above, you have my vote. I might be able to get around to it tonight if nobody else does.

With regard to strict functions, presumably the existing instance isn't any better? If the functions you're combining are all strict, you're basically asking for a giant pile of pattern matches to be put on the stack the moment you evaluate the application of that function. In most cases of that sort, I'd usually just switch from accumulating an Endo in some way, to accumulating the result of applying the functions instead. Using a different monoid whose operation isn't exactly ordinary composition, but builds in the $! is another option, but that's typically worse unless you need to apply the resulting function at inputs that are unknown at the outset (since you're still building a big function closure when you may not need to).

treeowl commented 3 years ago

@cgibbard , it's not worse in that case; they're both bad. There's a nice way to do the accumulation: see #13.

mixphix commented 3 years ago

+1 for a well-scoped, well-argued, and well-solved proposal.

chessai commented 3 years ago

Maybe I'm missing something simple, but would we not want a bang on the Int argument in go?

I'm +1

treeowl commented 3 years ago

@chessai, we certainly could do that. Maybe we should. It won't matter once it specializes because == is always strict. If it doesn't specialize, everything will be slow anyway, but that might help a tad.

chessai commented 3 years ago

@chessai, we certainly could do that. Maybe we should. It won't matter once it specializes because == is always strict. If it doesn't specialize, everything will be slow anyway, but that might help a tad.

I'd rather err on the side of caution and add a bang. It's low-cost and doesn't lose us anything, with only potential, albeit small, upside.

treeowl commented 3 years ago

Sure, I'm fine with that.

Bodigrim commented 3 years ago

Benchmarks on my machine are

ll
  0
    current:  OK (8.54s)
      3.97 ns ± 104 ps
    proposed: OK (6.90s)
      6.46 ns ± 236 ps, 1.63x
  1
    current:  OK (11.50s)
      5.40 ns ± 142 ps
    proposed: OK (14.13s)
      6.65 ns ± 248 ps, 1.23x
  10
    current:  OK (8.73s)
      131  ns ± 3.6 ns
    proposed: OK (13.23s)
      24.8 ns ± 482 ps, 0.19x
  100
    current:  OK (24.93s)
      1.49 μs ± 8.2 ns
    proposed: OK (8.81s)
      132  ns ± 2.4 ns, 0.09x
  1000
    current:  OK (16.20s)
      15.5 μs ± 207 ns
    proposed: OK (4.17s)
      991  ns ±  32 ns, 0.06x
  10000
    current:  OK (7.89s)
      242  μs ± 4.6 μs
    proposed: OK (3.07s)
      46.6 μs ± 1.8 μs, 0.19x
  100000
    current:  OK (6.60s)
      6.50 ms ± 123 μs
    proposed: OK (4.91s)
      599  μs ±  16 μs, 0.09x

I think slowdown for stimes 0 and stimes 1 is acceptable: it's a matter of nanoseconds and could very well be a side effect of benchmark harness (https://github.com/haskell/core-libraries-committee/issues/4#issuecomment-955605592).

+1

@cgibbard what is your decision? Ping @emilypi @gwils

treeowl commented 3 years ago

@Bodigrim, when benchmarking the "current" version, I suggest you write a copy of Endo and give it a Semigroup instance with stimes = stimesMonoid. As discussed in #8, the actual current version is dealing with a jammed up optimizer along with a seemingly extraneous tree.

Bodigrim commented 3 years ago

@treeowl I'm already convinced and voted in favor, but if you wish to provide other benchmarks, feel free to share them.

treeowl commented 3 years ago

@Bodigrim, oh, my comment was confused.... I mixed up Endo with something else. I'm just waiting on approval on this and approval (or abstention) on #8 to proceed.

emilypi commented 3 years ago

Approved on my end

Bodigrim commented 3 years ago

This gives us 4 votes in favor out of 6, but I would like to give @gwils and @cgibbard an opportunity to reply until tomorrow.

gwils commented 3 years ago

+1, looks fine to me

Bodigrim commented 3 years ago

The proposal is approved by CLC, thanks @treeowl!

chshersh commented 1 year ago

I'm trying to summarise the state of this proposal as part of my volunteering effort to track the progress of all approved CLC proposals.

Field Value
Author @treeowl
Status not merged
base version Unknown
Merge Request (MR) https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863
Blocked by https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10113
CHANGELOG entry missing
Migration guide not needed

Please, let me know if you find any mistakes 🙂


@treeowl could you share a status update on the implementation and what are next steps? Also, please do let CLC know if you need any help, so we can coordinate and prioritise approved proposals accordingly!

treeowl commented 1 year ago

Ah ... I initially implemented this in an overly-large GHC MR, which didn't get merged. @meooow has broken the larger part out of that MR; I should really submit a new one that just does the Endo bit.

Bodigrim commented 1 year ago

https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10113 has been finally merged. @treeowl could you possibly rebase https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863 just to land the changes for instance Semigroup (Endo a)?

treeowl commented 1 year ago

Many thanks to @meooow for pushing that through. Yes, I expect I can do the Endo bit this weekend.