Closed treeowl closed 3 years ago
For example, we could (sort of) borrow from Okasaki's random access lists, lazily converting the number to skew binary and tacking the bitCount
extras on the front in unary.
Heck, a balanced tree in pre-order should be better than the current leaf tree.
@treeowl could you possibly please elaborate what exactly is suboptimal with the current approach, and what and how can become better with alternative ones?
@Bodigrim, suppose I write appEndo (Endo f `stimes` n)
. This will produce something like this:
((f . f) . (f . f)) . ((f . f) . (f . f))
This is notionally a tree, but with shared subtrees. So the space usage is very good.
To actually make progress applying this to an argument, however, requires walking down the left spine of the tree to reach a leaf. This doesn't seem ideal; I would expect to reach the k
th function call in O(k)
time. The easiest way to achieve this is to give up on sharing. To achieve it with sharing seems considerably trickier, but achievable. So far I've only figured out how to do it (I think) with a sort of zeroless skew binary system, and I'm still working out some fenceposts. I can post the result when I'm done if you like.
@Bodigrim, here's a "fancy" way. It will need some adjustments to support arbitrary Integral
types if we decide to do something like this. (Basically, we'll need to come up with some fall-back for the case where the passed value is greater than maxBound :: Word64
.)
import Data.Bits (popCount)
stimesEndo :: Int -> Endo a -> Endo a
stimesEndo n0 x0 = case compare n0 0 of
LT -> error "Negative multiplier."
EQ -> mempty
GT -> start (popCount n0) (n0 `quot` 2)
where
start 1 0 = x0
start 0 1 = x0
start 0 n = f x0 n
start k n = x0 <> start (k - 1) n
f acc 1 = acc
f acc n
| odd n = acc <> f acc' (n `quot` 2)
| otherwise = acc <> acc <> g acc' (n `quot` 2)
where
acc' = x0 <> acc <> acc
g acc 1 = x0
g acc n
| odd n = x0 <> f acc' (n `quot` 2)
| otherwise = x0 <> acc <> g acc' (n `quot` 2)
where
acc' = x0 <> acc <> acc
Key points:
x0 <> acc <> acc
rather than to acc <> acc
. This means that we never have to dig deeper to get the next function than we've "paid for" in an amortized sense.I would not be at all surprised if there were a simpler way; this is just what I was able to come up with.
Ah, here's a simplification! It avoids the popCount
complications by doing that along the way.
stimesEndo :: Integral n => n -> Endo a -> Endo a
stimesEndo n0 x0 = case compare n0 0 of
LT -> error "Negative"
EQ -> mempty
GT
| n0 == 1 -> x0
| even n0 -> f x0 (n0 `quot` 2)
| otherwise -> x0 <> f x0 (n0 `quot` 2)
where
f acc 1 = x0 <> acc
f acc n
| odd n = x0 <> acc <> f acc' (n `quot` 2)
| otherwise = x0 <> acc <> acc <> g acc' (n `quot` 2)
where
acc' = x0 <> acc <> acc
g acc 1 = x0
g acc n
| odd n = x0 <> f acc' (n `quot` 2)
| otherwise = x0 <> acc <> g acc' (n `quot` 2)
where
acc' = x0 <> acc <> acc
Informal tests suggest my version is around twice as fast as the current version, so it's not just a theoretical improvement.
I've opened an MR for review.
Oh duh... Never mind that. The simple way is obviously better. Here's a new MR.
It would be really interesting to see performance tests on this, but overall, this looks like a good idea to me. If we can beat the default, let's do it.
@cgibbard , I just did some informal tests with Endo (1:)
and Endo (1+)
and found them pretty convincing. Give it a go!
Let me write down my understanding of the proposal.
Semigroup.stimes
is a short-cut for sconcat (replicate n a)
. The latter, implemented naively, takes O(n) time to produce a result.
stimesMonoid
implements stimes n a = sconcat (replicate n a)
by building a
, a2 = a <> a
, a4 = a2 <> a2
, a8 = a4 <> a4
, and concatenating intermediate values corresponding to non-zero binary digits of n
. This usually gives us O(log n) complexity.
In case of newtype Endo a = Endo { appEndo :: a -> a }
we can indeed build a thunk, corresponding to stimes n f
, in O(log n) time. However, we rarely wish to build such thunk just to adorn our heap, and most likely we intend to apply it to an argument. Unless f = const foo
, evaluation of appEndo (stimes n f) x
must invoke f
repeatedly n
times. However, the binary tree built by stimesMonoid
is likely to hide from GHC the exact nature of computation and cause high constant factor.
The proposal is to implement stimes
for Endo
not via stimesMonoid
, but rather via a generalized version of stimesList
. This has two-fold advantage: a thunk itself is built in O(1), and its evaluation, while still O(n), is much friendlier to GHC optimizer and runtime, likely to lead to performance gains.
@treeowl, is my rendering correct? Could you please share results for Endo (1+)
benchmarks?
@Bodigrim, yes, that's correct. I don't have proper benchmarks to show; I could probably try to produce some this weekend. The current way seems to be somewhere around three-ish times slower than the simple way.
I'm interested to see these benchmarks too, but it looks fine to me. Let's grab performance wins when we can.
I made a simplistic benchmark
module Main where
import Test.Tasty.Bench
import Data.Semigroup
import Data.Word
newStimes n _ | n < 0 = error "Cannot apply a function a negative number of times."
newStimes n0 e = go (fromIntegral n0)
where
go 0 = mempty
go n = e <> go (n - 1 :: Word64)
main :: IO ()
main = defaultMain $ map mkGroup [1, 10, 100, 1000, 10000, 100000]
mkGroup :: Int -> Benchmark
mkGroup n = bgroup (show n)
[ bench "current" $ nf (appEndo (stimes n (Endo (+1)))) (1 :: Int)
, bcompare ("$NF == \"current\" && $(NF-1)==\"" ++ show n ++ "\"") $
bench "proposed" $ nf (appEndo (newStimes n (Endo (+1)))) (1 :: Int)
]
Results are
All
1
current: OK (1.33s)
4.94 ns ± 448 ps
proposed: OK (1.91s)
7.13 ns ± 482 ps, 1.44x
10
current: OK (2.04s)
120 ns ± 9.0 ns
proposed: OK (1.43s)
21.1 ns ± 1.8 ns, 0.18x
100
current: OK (2.79s)
1.31 μs ± 66 ns
proposed: OK (1.69s)
201 ns ± 17 ns, 0.15x
1000
current: OK (1.82s)
13.8 μs ± 1.3 μs
proposed: OK (1.66s)
1.57 μs ± 139 ns, 0.11x
10000
current: OK (1.76s)
214 μs ± 18 μs
proposed: OK (1.71s)
52.0 μs ± 4.5 μs, 0.24x
100000
current: OK (2.84s)
5.56 ms ± 216 μs
proposed: OK (2.85s)
691 μs ± 68 μs, 0.12x
Could we improve case of stimes 1
?
What happens if you add a special case:
go 1 = e
It's worth testing also with Endo (() :)
and (for very large n
) Endo (const ())
.
Feel free to benchmark, source code is all there.
Adding special cases to go doesn't work, but adding the branch at the top level does:
module Main where
import GHC.IO.Encoding
import Test.Tasty.Bench
import Data.Semigroup
import Data.Word
newStimes :: (Integral b, Monoid m) => b -> m -> m
newStimes n _ | n < 0 = error "Cannot apply a function a negative number of times."
newStimes 0 e = mempty
newStimes 1 e = e
newStimes n0 e = go (fromIntegral n0)
where
go 0 = mempty
go n = e <> go (n - 1 :: Word64)
main :: IO ()
main = do
setLocaleEncoding utf8
defaultMain $ map mkGroup [0, 1, 10, 100, 1000, 10000, 100000]
mkGroup :: Int -> Benchmark
mkGroup n = bgroup (show n)
[ bench "current" $ nf (appEndo (stimes n (Endo (+1)))) (1 :: Int)
, bcompare ("$NF == \"current\" && $(NF-1)==\"" ++ show n ++ "\"") $
bench "proposed" $ nf (appEndo (newStimes n (Endo (+1)))) (1 :: Int)
]
gives:
All
0
current: OK (0.19s)
5.5 ns ± 460 ps
proposed: OK (0.19s)
5.5 ns ± 434 ps, 1.00x
1
current: OK (0.21s)
6.2 ns ± 470 ps
proposed: OK (0.21s)
6.2 ns ± 416 ps, 1.00x
10
current: OK (0.26s)
122 ns ± 8.7 ns
proposed: OK (0.25s)
29 ns ± 2.2 ns, 0.24x
100
current: OK (0.17s)
1.3 μs ± 96 ns
proposed: OK (0.25s)
238 ns ± 11 ns, 0.19x
1000
current: OK (0.24s)
14 μs ± 688 ns
proposed: OK (0.16s)
2.4 μs ± 189 ns, 0.17x
10000
current: OK (0.12s)
222 μs ± 21 μs
proposed: OK (0.11s)
58 μs ± 5.7 μs, 0.26x
100000
current: OK (0.18s)
5.7 ms ± 516 μs
proposed: OK (0.20s)
799 μs ± 50 μs, 0.14x
All 14 tests passed (2.77s)
(Note that there was also a slowdown in the 0 case originally.)
My current draft implementation is here: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863
It would probably make sense to benchmark that.
The main special bit about my version is that stimes
is guaranteed to have arity 3 even if GHC switches to -fpedantic-bottoms
.
I finally got https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863 to compile, so I'd appreciate reviews. Can anyone remind me how to add a perf test for GHC?
@Bodigrim , since you asked, I'm specifically seeking approval for this: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863
I just incorporated @cgibbard's adjustment.
@treeowl the linked change does not compile.
Yeah, minor error in an update; I'll fix it now.
@Bodigrim, it should be fixed now. It was a very silly mistake.
I just realized something: stimes
for Endo
is lousy when applied to strict functions. There are two ways to approach this:
StrictEndo
type (in base
or not) with a different stimes
implementation.stimes
is applied to a strict function.I'm leaning toward (1), but if we want to do (2), here's how it would work, roughly (I've actually made this sort of thing work before, but it's a bit fiddly):
Define
{-# NOINLINE knownBottom #-}
knownBottom :: a
knownBottom = errorWithoutStackTrace "This should never be reached"
Then add some rewrite rules:
{-# RULES
"stimesEndo" [~1] forall n e. stimesEndo n (Endo e) = stimesEndoMagic n e (e knownBottom)
"stimesEndoMagic" [~1] forall n e. stimesEndoMagic n e knownBottom = stimesEndo' n e
#-}
stimesEndoMagic n e _ = stimesEndo n e
{-# INLINE [1] stimesEndoMagic #-}
-- A verson of stimesEndo for strict functions
stimesEndo' :: Int -> Endo a -> Endo a
stimesEndo' n0 (Endo e) a0 = case n0 of
_ | n0 < 0 -> errorWithoutStackTrace "negative"
0 -> a0
1 -> e a0
n -> go n0 a
where
go 0 a = a
go n a = go n $! e a
I certainly do not want to enshrine such black magic in base
without a very strong justification, so I suggest we stick to the original scope.
According to https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863/diffs @treeowl seeks CLC's approval to change stimes
for Endo a
from stimesMonoid
to
stimes n0 (Endo e) = Endo (\a0 ->
-- We check separately for 0 and 1 per
-- https://github.com/haskell/core-libraries-committee/issues/4#issuecomment-955605592
case n0 of
_ | n0 < 0 -> stimesEndoError
0 -> a0
1 -> e a0
_ -> go n0 a0)
where
go 0 a = a
go n a = e (go (n - 1) a)
{-# NOINLINE stimesEndoError #-}
-- There's no reason to put this gunk in the unfolding.
stimesEndoError :: a
stimesEndoError = errorWithoutStackTrace "stimes (for Endo): negative multiplier"
(Other changes in the same MR are tracked in #8 and will require a separate approval).
@treeowl would you like to trigger vote immediately?
@Bodigrim , yes please.
Dear CLC members, please vote below on the proposal to change the implementation of stimes
for Endo
as per https://github.com/haskell/core-libraries-committee/issues/4#issuecomment-956596027.
CC @cgibbard @cigsender @gwils @chessai @emilypi
Did anyone rerun the benchmarks on the actual code? If it's an improvement or equivalent for the basic cases we tried above, you have my vote. I might be able to get around to it tonight if nobody else does.
With regard to strict functions, presumably the existing instance isn't any better? If the functions you're combining are all strict, you're basically asking for a giant pile of pattern matches to be put on the stack the moment you evaluate the application of that function. In most cases of that sort, I'd usually just switch from accumulating an Endo in some way, to accumulating the result of applying the functions instead. Using a different monoid whose operation isn't exactly ordinary composition, but builds in the $! is another option, but that's typically worse unless you need to apply the resulting function at inputs that are unknown at the outset (since you're still building a big function closure when you may not need to).
@cgibbard , it's not worse in that case; they're both bad. There's a nice way to do the accumulation: see #13.
+1 for a well-scoped, well-argued, and well-solved proposal.
Maybe I'm missing something simple, but would we not want a bang on the Int argument in go
?
I'm +1
@chessai, we certainly could do that. Maybe we should. It won't matter once it specializes because ==
is always strict. If it doesn't specialize, everything will be slow anyway, but that might help a tad.
@chessai, we certainly could do that. Maybe we should. It won't matter once it specializes because
==
is always strict. If it doesn't specialize, everything will be slow anyway, but that might help a tad.
I'd rather err on the side of caution and add a bang. It's low-cost and doesn't lose us anything, with only potential, albeit small, upside.
Sure, I'm fine with that.
Benchmarks on my machine are
ll
0
current: OK (8.54s)
3.97 ns ± 104 ps
proposed: OK (6.90s)
6.46 ns ± 236 ps, 1.63x
1
current: OK (11.50s)
5.40 ns ± 142 ps
proposed: OK (14.13s)
6.65 ns ± 248 ps, 1.23x
10
current: OK (8.73s)
131 ns ± 3.6 ns
proposed: OK (13.23s)
24.8 ns ± 482 ps, 0.19x
100
current: OK (24.93s)
1.49 μs ± 8.2 ns
proposed: OK (8.81s)
132 ns ± 2.4 ns, 0.09x
1000
current: OK (16.20s)
15.5 μs ± 207 ns
proposed: OK (4.17s)
991 ns ± 32 ns, 0.06x
10000
current: OK (7.89s)
242 μs ± 4.6 μs
proposed: OK (3.07s)
46.6 μs ± 1.8 μs, 0.19x
100000
current: OK (6.60s)
6.50 ms ± 123 μs
proposed: OK (4.91s)
599 μs ± 16 μs, 0.09x
I think slowdown for stimes 0
and stimes 1
is acceptable: it's a matter of nanoseconds and could very well be a side effect of benchmark harness (https://github.com/haskell/core-libraries-committee/issues/4#issuecomment-955605592).
+1
@cgibbard what is your decision? Ping @emilypi @gwils
@Bodigrim, when benchmarking the "current" version, I suggest you write a copy of Endo
and give it a Semigroup
instance with stimes = stimesMonoid
. As discussed in #8, the actual current version is dealing with a jammed up optimizer along with a seemingly extraneous tree.
@treeowl I'm already convinced and voted in favor, but if you wish to provide other benchmarks, feel free to share them.
@Bodigrim, oh, my comment was confused.... I mixed up Endo
with something else. I'm just waiting on approval on this and approval (or abstention) on #8 to proceed.
Approved on my end
This gives us 4 votes in favor out of 6, but I would like to give @gwils and @cgibbard an opportunity to reply until tomorrow.
+1, looks fine to me
The proposal is approved by CLC, thanks @treeowl!
I'm trying to summarise the state of this proposal as part of my volunteering effort to track the progress of all approved
CLC proposals.
Field | Value |
---|---|
Author | @treeowl |
Status | not merged |
base version |
Unknown |
Merge Request (MR) | https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863 |
Blocked by | https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10113 |
CHANGELOG entry | missing |
Migration guide | not needed |
Please, let me know if you find any mistakes 🙂
@treeowl could you share a status update on the implementation and what are next steps? Also, please do let CLC know if you need any help, so we can coordinate and prioritise approved proposals accordingly!
Ah ... I initially implemented this in an overly-large GHC MR, which didn't get merged. @meooow has broken the larger part out of that MR; I should really submit a new one that just does the Endo
bit.
https://gitlab.haskell.org/ghc/ghc/-/merge_requests/10113 has been finally merged. @treeowl could you possibly rebase https://gitlab.haskell.org/ghc/ghc/-/merge_requests/6863 just to land the changes for instance Semigroup (Endo a)
?
Many thanks to @meooow for pushing that through. Yes, I expect I can do the Endo
bit this weekend.
Currently,
stimesMonoid
produces a balanced tree. I guess this enhances sharing within the resulting function (if it's realized as a closure), but I'm wondering if it's really a good idea, especially when the argument is small.Here's a simpler option to play with:
Another possibility might be to make a tree that only leans to the right. That would get immediate access to the leftmost (most important) nodes while maintaining sharing for the large case.