Better unordered folds?

treeowl commented 1 year ago

I'm not a huge fan of the way we implement unordered folds. The compiler has no clue what they're up to, so optimization is garbage. I experimented with using an operation

viewU :: BinomHeap a -> Maybe (a, BinomHeap a)

which extracts the root of the first binomial tree, to implement foldlU'.

Unfortunately, the performance was worse than what we do now. But I wonder if things might change if we work in bigger chunks. For starters, something like

viewU2 :: BinomForest (Succ Zero) a -> Maybe (a, a, BinomForest (Succ Zero) a)

Working with two elements at once should reduce the churn in the low bits, and might give this technique some hope.

treeowl commented 1 year ago

For foldlU', this is probably hopeless. A decrementing (or otherwise rearranging) version will always allocate $O(n)$ space for itself, whereas the current version gets away with $O(\log n)$ space. The situation seems a bit different for foldrU. Thanks to the lack of optimization, most interesting uses of it will allocate $O(n)$ anyway. There are some exceptions, though, such as

-- boo p = List.filter p . toListU
boo p = flip foldrU [] $ \ a r ->
  if p a
  then a : r
  else r

So maybe this idea is no good. I'd love if GHC got better at optimizing what we do now....

treeowl commented 1 year ago

Aha! I had an idea. I will benchmark now. This is structured a lot like our current thing, but with none of the typeclass stuff, so we can pull the function out.

data Ranky rk where
  Oney :: Ranky (Succ Zero)
  Succy :: !(Ranky rk) -> Ranky (Succ rk)

foldlU2F' :: forall a b. (b -> a -> b) -> BinomHeap a -> b -> b
foldlU2F' f = start
  where
    start :: BinomHeap a -> b -> b
    start Nil !b = b
    start (Skip ts) !b = go Oney ts b
    start (Cons (BinomTree a ~Zero) ts) !b = go Oney ts (f b a)

    go :: Ranky rk -> BinomForest rk a -> b -> b
    go !_rky Nil b = b
    go !rky (Skip ts) !b = go (Succy rky) ts b
    go !rky (Cons t ts) !b = go (Succy rky) ts (goTree rky t b)

    goTree :: Ranky rk -> BinomTree rk a -> b -> b
    goTree rky (BinomTree a ts) !b = goTrees rky ts (f b a)
    {-# INLINE goTree #-}

    goTrees :: Ranky rk -> rk a -> b -> b
    goTrees Oney (Succ (BinomTree a ~Zero) ~Zero) !b = f b a
    goTrees (Succy rk) (Succ t ts) !b =
      let !b' = goTree rk t b
      in goTrees rk ts b'

treeowl commented 1 year ago

If that code is too big, we can switch from Oney to Zeroy. The only problem there is that the pattern matches on Zeroy won't reveal any actual elements, which seems wasteful. But maybe it's worth it to make the inlining situation better.....

treeowl commented 1 year ago

The zero-based version:

data Ranky rk where
  Zeroy :: Ranky Zero
  Succy :: !(Ranky rk) -> Ranky (Succ rk)

foldlU2F' :: forall a b. (b -> a -> b) -> BinomHeap a -> b -> b
foldlU2F' f = go Zeroy
  where
    go :: Ranky rk -> BinomForest rk a -> b -> b
    go !_rky Nil b = b
    go !rky (Skip ts) !b = go (Succy rky) ts b
    go !rky (Cons t ts) !b = go (Succy rky) ts (goTree rky t b)

    goTree :: Ranky rk -> BinomTree rk a -> b -> b
    goTree rky (BinomTree a ts) !b = goTrees rky ts (f b a)
    {-# INLINE goTree #-}

    goTrees :: Ranky rk -> rk a -> b -> b
    goTrees Zeroy ~Zero b = b
    goTrees (Succy rk) (Succ t ts) !b =
      let !b' = goTree rk t b
      in goTrees rk ts b'
{-# INLINE foldlU2F' #-}

My benchmarks suggest that for summing Int heaps, either the zero-based or the one-based GADT is consistently a little better than the method we use now. One-based seems to be a little faster than zero-based in general, but that gets really inconsistent for large heaps for reasons I can't guess. I think for the moment I'll go for the zero-based, since it takes less code. Of course, if we go to one-based heaps in the end, the choice will go away.

treeowl commented 1 year ago

Hmm.... The zero-based version might be worse than the current one when it's not INLINEd, whereas the one-based version looks equivalent to the current one in that case. So one-based might be the way to go after all. One more thing we can do that seems to help is fake the GADT, using a Word to represent the count and some nasty unsafeCoerce based pattern synonyms to manufacture evidence. That makes the zero-based version just as good as the current version, and the one-based version better, even without INLINE. Presumably that helps because it replaces pointer chasing with cheap arithmetic, but I can't say for sure.

treeowl commented 1 year ago

For reference, here's the 0-based fake GADT I tried:

newtype Ranky (rk :: Type -> Type) = Ranky Word

data Ranky' rk where
  FZeroy :: Ranky' Zero
  FSuccy :: !(Ranky rk) -> Ranky' (Succ rk)

getRanky :: Ranky rk -> Ranky' rk
getRanky (Ranky 0) = unsafeCoerce FZeroy
getRanky (Ranky n) = unsafeCoerce (FSuccy (Ranky (n - 1)))

pattern Zeroy :: forall rk. () => rk ~ Zero => Ranky rk
pattern Zeroy <- (getRanky -> FZeroy)
  where
    Zeroy = Ranky 0

pattern Succy :: forall rk. () => forall rk'. rk ~ Succ rk' => Ranky rk' -> Ranky rk
pattern Succy rk <- (getRanky -> FSuccy rk)
  where
    Succy (Ranky n) = Ranky (n + 1)

{-# INLINE Zeroy #-}
{-# INLINE Succy #-}
{-# COMPLETE Zeroy, Succy #-}

The one-based one is essentially the same, but with zeroish things turned to oneish things.

lspitzner / pqueue

Better unordered folds? #112