Open treeowl opened 1 year ago
For foldlU'
, this is probably hopeless. A decrementing (or otherwise rearranging) version will always allocate $O(n)$ space for itself, whereas the current version gets away with $O(\log n)$ space. The situation seems a bit different for foldrU
. Thanks to the lack of optimization, most interesting uses of it will allocate $O(n)$ anyway. There are some exceptions, though, such as
-- boo p = List.filter p . toListU
boo p = flip foldrU [] $ \ a r ->
if p a
then a : r
else r
So maybe this idea is no good. I'd love if GHC got better at optimizing what we do now....
Aha! I had an idea. I will benchmark now. This is structured a lot like our current thing, but with none of the typeclass stuff, so we can pull the function out.
data Ranky rk where
Oney :: Ranky (Succ Zero)
Succy :: !(Ranky rk) -> Ranky (Succ rk)
foldlU2F' :: forall a b. (b -> a -> b) -> BinomHeap a -> b -> b
foldlU2F' f = start
where
start :: BinomHeap a -> b -> b
start Nil !b = b
start (Skip ts) !b = go Oney ts b
start (Cons (BinomTree a ~Zero) ts) !b = go Oney ts (f b a)
go :: Ranky rk -> BinomForest rk a -> b -> b
go !_rky Nil b = b
go !rky (Skip ts) !b = go (Succy rky) ts b
go !rky (Cons t ts) !b = go (Succy rky) ts (goTree rky t b)
goTree :: Ranky rk -> BinomTree rk a -> b -> b
goTree rky (BinomTree a ts) !b = goTrees rky ts (f b a)
{-# INLINE goTree #-}
goTrees :: Ranky rk -> rk a -> b -> b
goTrees Oney (Succ (BinomTree a ~Zero) ~Zero) !b = f b a
goTrees (Succy rk) (Succ t ts) !b =
let !b' = goTree rk t b
in goTrees rk ts b'
If that code is too big, we can switch from Oney
to Zeroy
. The only problem there is that the pattern matches on Zeroy
won't reveal any actual elements, which seems wasteful. But maybe it's worth it to make the inlining situation better.....
The zero-based version:
data Ranky rk where
Zeroy :: Ranky Zero
Succy :: !(Ranky rk) -> Ranky (Succ rk)
foldlU2F' :: forall a b. (b -> a -> b) -> BinomHeap a -> b -> b
foldlU2F' f = go Zeroy
where
go :: Ranky rk -> BinomForest rk a -> b -> b
go !_rky Nil b = b
go !rky (Skip ts) !b = go (Succy rky) ts b
go !rky (Cons t ts) !b = go (Succy rky) ts (goTree rky t b)
goTree :: Ranky rk -> BinomTree rk a -> b -> b
goTree rky (BinomTree a ts) !b = goTrees rky ts (f b a)
{-# INLINE goTree #-}
goTrees :: Ranky rk -> rk a -> b -> b
goTrees Zeroy ~Zero b = b
goTrees (Succy rk) (Succ t ts) !b =
let !b' = goTree rk t b
in goTrees rk ts b'
{-# INLINE foldlU2F' #-}
My benchmarks suggest that for summing Int
heaps, either the zero-based or the one-based GADT is consistently a little better than the method we use now. One-based seems to be a little faster than zero-based in general, but that gets really inconsistent for large heaps for reasons I can't guess. I think for the moment I'll go for the zero-based, since it takes less code. Of course, if we go to one-based heaps in the end, the choice will go away.
Hmm.... The zero-based version might be worse than the current one when it's not INLINE
d, whereas the one-based version looks equivalent to the current one in that case. So one-based might be the way to go after all. One more thing we can do that seems to help is fake the GADT, using a Word
to represent the count and some nasty unsafeCoerce
based pattern synonyms to manufacture evidence. That makes the zero-based version just as good as the current version, and the one-based version better, even without INLINE
. Presumably that helps because it replaces pointer chasing with cheap arithmetic, but I can't say for sure.
For reference, here's the 0-based fake GADT I tried:
newtype Ranky (rk :: Type -> Type) = Ranky Word
data Ranky' rk where
FZeroy :: Ranky' Zero
FSuccy :: !(Ranky rk) -> Ranky' (Succ rk)
getRanky :: Ranky rk -> Ranky' rk
getRanky (Ranky 0) = unsafeCoerce FZeroy
getRanky (Ranky n) = unsafeCoerce (FSuccy (Ranky (n - 1)))
pattern Zeroy :: forall rk. () => rk ~ Zero => Ranky rk
pattern Zeroy <- (getRanky -> FZeroy)
where
Zeroy = Ranky 0
pattern Succy :: forall rk. () => forall rk'. rk ~ Succ rk' => Ranky rk' -> Ranky rk
pattern Succy rk <- (getRanky -> FSuccy rk)
where
Succy (Ranky n) = Ranky (n + 1)
{-# INLINE Zeroy #-}
{-# INLINE Succy #-}
{-# COMPLETE Zeroy, Succy #-}
The one-based one is essentially the same, but with zeroish things turned to oneish things.
I'm not a huge fan of the way we implement unordered folds. The compiler has no clue what they're up to, so optimization is garbage. I experimented with using an operation
which extracts the root of the first binomial tree, to implement
foldlU'
.Unfortunately, the performance was worse than what we do now. But I wonder if things might change if we work in bigger chunks. For starters, something like
Working with two elements at once should reduce the churn in the low bits, and might give this technique some hope.