haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
314 stars 177 forks source link

Cleanup after IntMap rewrite #698

Open treeowl opened 4 years ago

treeowl commented 4 years ago

This ticket summarizes things that should be done either before or soon after the enormous IntMap replacement in #340.

Before merge

After merge

Uncategorized

treeowl commented 4 years ago

@gereeter, @sjakobi, please let me know what should we added and what, in your opinions, should be done before and after the merge.

treeowl commented 4 years ago

We may want to revisit the recent decision to make minimum and maximum traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified. Alternatively, I guess we can probably be very clever and keep track of a minimum from the left (updated only when the next element is smaller) and a minimum from the right (updated when the previous element is not larger), and pick whichever is smallest in the end, preferring the one from the left in a tie.

product should definitely preserve the order of the elements in case the Num instance is a non-commutative ring. sum should probably preserve the order of the elements in case someone does something a bit weird, but I could be convinced to relax that.

treeowl commented 4 years ago

@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.

treeowl commented 4 years ago

I keep wondering whether making the tree go entirely in order is really the right way. The obvious alternative is for the order to alternate from level to level, so that keys go to (say) the left whenever they're closer to the bound stored in the current node, whether that's a lower bound or an upper bound.

That would (potentially) go along with alternating whether to flip the sign of the bound when storing it, and flipping the sign of the key on each step down. I don't know if that would actually work or not, but it strikes me as plausible.

Edit: the sign flipping would be done using complement, because -minBound = minBound. The idea is that k < max is the same as complement max < complement k.

sjakobi commented 4 years ago

We may want to revisit the recent decision to make minimum and maximum traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified.

My understanding is that the new IntMap doesn't suffer from the weird order reversal for negative keys that the old one had. The new haddocks on IntMap_ explain it quite clearly IMHO.

In consequence I believe that we don't lose anything by keeping minimum and maximum in-order.

treeowl commented 4 years ago

We may want to revisit the recent decision to make minimum and maximum traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified.

My understanding is that the new IntMap doesn't suffer from the weird order reversal for negative keys that the old one had. The new haddocks on IntMap_ explain it quite clearly IMHO.

In consequence I believe that we don't lose anything by keeping minimum and maximum in-order.

We don't get the weird order reversal of children, but we now have values in internal nodes, and their positions in the order depend on their level. To go strictly in order from a node whose bound is a maximum, we have to push the right subtree and the bound's associated value onto the stack before descending the left subtree. I don't know for sure if we can improve matters or not, but we might save a little by dealing with those values "as they come" to the extent we can. That surely applies to minimum and maximum, and I imagine it applies to sum and product too if we're careful....

gereeter commented 4 years ago

Generate a final benchmark comparison. I prefer to see these interleaved

My latest commits that have benchmark data just used the output of bench-cmp.pl. Is that fine?

Rewrite IntSet to match.

Tricky: I actually did this back when the work was out-of-tree, and although I'm sure with time (and newer GHC) one could press the optimizations more and make different design choices, it was slower than stock IntSet. Probably worth trying again, but there is a large space of choices to try out:

Check whether the restructuring in #658 is a good idea

And #653. I was putting it off as non-essential, but especially since fromList is one of only two regressions according to my up-to-date benchmarks (oddly, not insert even though fromList is just a loop over insert), I was keeping it in mind. But really, this can be post-merge.

Continue to add internal documentation. For example, isSubmapOf

The two big documentation blocks I was working to get done before merging were about deletions and merges, and isSubmapOf is basically a merge. Specific applications to specific functions can maybe wait until after merge, but those should probably be pre-merge.

We generally aim for 80 characters or fewer.

Aw, darn. I was wrapping my documentation at 100 characters. That at least is easy to fix. While there are definitely easy-to-wrap lines, I'm not so sure about some of the worst offenders. Maybe make an effort pre-merge but don't worry about it as a blocker?

Evaluate whether we should implement alter "by hand" as we used to. If not, then we should probably implement it using alterF @Identity.

I vaguely remember finding the trivial two-pass algorithm surprisingly faster that a specialized implementation? But it has been a while, and my memory is fuzzy. This will probably be pre-merge just because I'm curious now.

@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.

Jonathan "gereeter" S.

@gereeter, @sjakobi, please let me know what should [be] added

Pre-merge:

Post-merge:

gereeter commented 4 years ago

I don't know for sure if we can improve matters or not, but we might save a little by dealing with those values "as they come" to the extent we can.

That's why I implemented elem, although I suspect the win is greater in a case like that since you can terminate early as opposed to just having less on the stack. It is also kind of the reason why merge can be so much better than mergeA.

sjakobi commented 4 years ago

Additions for "before merge":


  • [ ] Check whether the restructuring in #658 is a good idea for the new representation and, if so, reinstate it.

Can someone estimate what kind of speedup we can expect from this? If it's, say, < 20%, I'd say it can wait.

  • [ ] Continue to add internal documentation. For example, isSubmapOf and the like are pretty confusing.

+0.5 for adding documentation before the merge – just in case @gereeter gets hit by a bus.

  • [ ] Shorten long lines when reasonable. We generally aim for 80 characters or fewer. There's flexibility for unusual situations, but this code has some very long lines that could be broken cleanly.

+0.1 for doing this before the merge in order to reduce noise in git blame. Not a blocker IMO.

  • [ ] Use liftA* in mergeA.

+0.1 as above

  • [ ] Evaluate whether we can get the same performance using foldMapDefault, fmapDefault, and equivalent tricks for keyed versions. That would cut down on source code.

This can easily be delayed IMHO

treeowl commented 4 years ago

Generate a final benchmark comparison. I prefer to see these interleaved

My latest commits that have benchmark data just used the output of bench-cmp.pl. Is that fine?

Yeah, that's fine.

Rewrite IntSet to match.

Tricky: I actually did this back when the work was out-of-tree, and although I'm sure with time (and newer GHC) one could press the optimizations more and make different design choices, it was slower than stock IntSet. Probably worth trying again, but there is a large space of choices to try out:

  • When a min or max shows up higher in the tree, should it be repeated in the bitmaps?
  • Should the bitmaps be aligned, representing series of integers with the same bit prefix, or should they start at the min/max for the node to not have waste zeros?
  • Is a singleton node represented as itself with two empty children or as a one-hot bitmap?
  • etc.

The most conservative approach, I think, is to match IntMap very directly, storing aligned bitmaps where IntMap stores values and using masked elements for keys. Did you try that version?

Check whether the restructuring in #658 is a good idea

And #653. I was putting it off as non-essential, but especially since fromList is one of only two regressions according to my up-to-date benchmarks (oddly, not insert even though fromList is just a loop over insert), I was keeping it in mind. But really, this can be post-merge.

653 isn't merged yet because it's not a pure win and I froze up trying to make a decision. Your fromList regression is rather surprising!

Continue to add internal documentation. For example, isSubmapOf

The two big documentation blocks I was working to get done before merging were about deletions and merges, and isSubmapOf is basically a merge. Specific applications to specific functions can maybe wait until after merge, but those should probably be pre-merge.

You've definitely gone a long way toward getting documentation in shape. Thanks.

We generally aim for 80 characters or fewer.

Aw, darn. I was wrapping my documentation at 100 characters. That at least is easy to fix. While there are definitely easy-to-wrap lines, I'm not so sure about some of the worst offenders. Maybe make an effort pre-merge but don't worry about it as a blocker?

Hard-to-wrap lines are definitely a low priority.

Evaluate whether we should implement alter "by hand" as we used to. If not, then we should probably implement it using alterF @Identity.

I vaguely remember finding the trivial two-pass algorithm surprisingly faster that a specialized implementation? But it has been a while, and my memory is fuzzy. This will probably be pre-merge just because I'm curious now.

The generic version will always be better in the "key absent, don't insert" case, because we don't have anything like a fast setjmp/longjmp. In other cases, specialized versions will likely be better.

@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.

Jonathan "gereeter" S.

@gereeter, @sjakobi, please let me know what should [be] added

I'll add those.

treeowl commented 4 years ago

Additions for "before merge":

  • Check that GHC can use the new IntMap without (undue) performance degradation. (I can trigger the build jobs for this)

Thanks.

  • [ ] Check whether the restructuring in #658 is a good idea for the new representation and, if so, reinstate it.

Can someone estimate what kind of speedup we can expect from this? If it's, say, < 20%, I'd say it can wait.

With the previous representation, the performance improvement was small, but the code was (in my opinion) much easier to understand.

  • [ ] Evaluate whether we can get the same performance using foldMapDefault, fmapDefault, and equivalent tricks for keyed versions. That would cut down on source code.

This can easily be delayed IMHO

Of course it can, but it shouldn't take long to check and reducing the enormous quantity of source code would be very nice.

treeowl commented 4 years ago

What if the stored bounds in nodes are all minima, and the passed in ones are all maxima, with complements being taken (of bounds and keys) as necessary to maintain this? Ignoring the top level for a moment, lookup would go like this:

We have the stored minimum, the passed maximum, and the key. We xor the key with each and compare, choosing the left child if closer to the minimum and vice versa. Now in the left child, we would naturally store a new maximum, but instead we store its complement as a new minimum, and pass down the complement of the minimum and of the key. In the right child, we store the new minimum and pass down the key and maximum unchanged.

Would that work, @gereeter? I'd really love to cut down on the code size here, but my intuition for the structure is not yet wonderful.

gereeter commented 4 years ago

Would that work, @gereeter?

Yes, technically, but I'd expect a large performance hit for needlessly complementing all over the place. And while I have the bias of understanding what is going on now, I think it would be a lot more confusing, too.

The biggest thing to do for cutting down on code size is unifying functions that do similar things (Lazy and Strict, variants of one function, etc.). With judicious INLINE, this doesn't affect things at all. If we need to deduplicate code for different types of nodes, the type class approach seems much more appropriate (with the two-parameter hack to avoid type families), and performance can be kept the same with a bunch of SPECIALIZE pragmas.

(On that note: local functions currently don't have type signatures for the most part because of the lack of ScopedTypeVariables. Is that something we can get away with using? I see that it is gated in Data.Sequence.Internal.)

my intuition for the structure is not yet wonderful.

Anything in particular? Any confusion is an opportunity for better documentation in my mind.

treeowl commented 4 years ago

Would that work, @gereeter?

Yes, technically, but I'd expect a large performance hit for needlessly complementing all over the place.

I'd be surprised. Basic operations like complement are usually extremely cheap; I'd expect their cost to be totally trivial compared to chasing pointers.

And while I have the bias of understanding what is going on now, I think it would be a lot more confusing, too.

That seems a much more likely problem. But the source duplication of the current scheme is also a potential source of errors and confusion. It seems to take some time to figure out what flops and how.

The biggest thing to do for cutting down on code size is unifying functions that do similar things (Lazy and Strict, variants of one function, etc.). With judicious INLINE, this doesn't affect things at all. If we need to deduplicate code for different types of nodes, the type class approach seems much more appropriate (with the two-parameter hack to avoid type families), and performance can be kept the same with a bunch of SPECIALIZE pragmas.

That certainly seems a promising approach to the source duplication, though it does nothing for object code size.

(On that note: local functions currently don't have type signatures for the most part because of the lack of ScopedTypeVariables. Is that something we can get away with using? I see that it is gated in Data.Sequence.Internal.)

Let's keep those gated for now … the current approach to scoped type variables is controversial, to say the least, with both Eisenberg and Kmett preferring a different sort. Again, a macro in containers.h is the way to do it.

my intuition for the structure is not yet wonderful.

Anything in particular? Any confusion is an opportunity for better documentation in my mind.

I think it could be helpful to have a few diagrams demonstrating basic operations, but those can be tough in text. Feel free to include something separate in HTML+SVG, LaTeX, or whatever, as long as it's not too big. But that can certainly wait till after the merge.

gereeter commented 4 years ago

Let's keep those gated for now

Ah, that could be a problem for reducing duplication. Without type signatures, GHC won't infer maximally polymorphic types for local definitions and we can't use SPECIALIZE pragmas, at least as far as I've figured out.

treeowl commented 4 years ago

You may need to tie it to the type family usage to avoid a nasty NoMonoLocalBinds. And you don't need SPECIALIZE in the fallback.

gereeter commented 4 years ago

Other random API "omission": isSubmapOfWithKey :: (Key -> a -> b -> Bool) -> IntMap a -> IntMap b -> Bool. I don't see a use for it, but it would match all the other WithKey functions. It is possible to emulate with mergeA and a constant Applicative, but as with a lot of cases like that, you want to short-circuit based on the uppermost values.

treeowl commented 4 years ago

Other random API "omission": isSubmapOfWithKey :: (Key -> a -> b -> Bool) -> IntMap a -> IntMap b -> Bool. I don't see a use for it, but it would match all the other WithKey functions. It is possible to emulate with mergeA and a constant Applicative, but as with a lot of cases like that, you want to short-circuit based on the uppermost values.

Yes, we should consider that, ideally with input from the libraries list. I wouldn't call it a priority myself.

treeowl commented 4 years ago

I'd like to clarify something: the reason I care about object code size is not that it takes up space on disk or in RAM, but that it takes up space in cache and involves more code in each operation. This can have important performance consequences. I played around a bit and got insert and lookup working with the alternative approach. I made a brief unsuccessful attempt at delete; I'll try again when I get a chance. With this approach, using a type family introduces constraints of the form t ~ Comp (Comp t) all over the place, so I used the two-variable approach instead.

-- @P C@ tags a value as representing itself;
-- @C P@ tags a value as representing its complement.
data P
data C

-- A key tagged to indicate whether it represents itself
-- or its complement.
newtype TKey t u = TKey {getKey :: Key}
  deriving (Eq, Ord)

-- Each node stores the minimum value of that subtree. That may actually be
-- a *maximum* when considering what's represented.
newtype IntMap a = IntMap (IntMap_ P C a) deriving (Eq, Show)
data IntMap_ t u a
  = NonEmpty {-# UNPACK #-} !(TKey t u) a !(Node u t a) | Empty deriving (Eq)
data Node t u a = Bin {-# UNPACK #-} !(TKey t u) a !(Node u t a) !(Node t u a) | Tip deriving (Eq)

deriving instance Show a => Show (IntMap_ P C a)
deriving instance Show a => Show (IntMap_ C P a)
deriving instance Show a => Show (Node P C a)
deriving instance Show a => Show (Node C P a)

-- This type is just used for convenience when defining the Show instances for TKey.
data KeyRep = PKey !Key | CKey !Key deriving Show
instance Show (TKey P C) where
  showsPrec p (TKey k) = showsPrec p (PKey k)
instance Show (TKey C P) where
  showsPrec p (TKey k) = showsPrec p (CKey (complement k))

-- | Take the complement of a key
compKey :: TKey t u -> TKey u t
compKey (TKey k) = TKey (complement k)

xor :: TKey t u -> TKey t u -> Word
xor (TKey k) (TKey b) = fromIntegral $ Bits.xor k b

lookup :: forall a. Key -> IntMap a -> Maybe a
lookup k0 (IntMap m0) = start (TKey k0) m0
  where
    start :: TKey t u -> IntMap_ t u a -> Maybe a
    start !k Empty = Nothing
    start k (NonEmpty min minV node)
        | k < min = Nothing
        | k == min = Just minV
        | otherwise = go (compKey k) (xor k min) node

    go :: TKey t u -> Word -> Node t u a -> Maybe a
    go !k !_ Tip = Nothing
    go k xorCache (Bin min minV l r)
        | k > min = if xorCache < xorCacheMin
                    then go k xorCache r
                    else go (compKey k) xorCacheMin l
        | k < min = Nothing
        | otherwise = Just minV
      where xorCacheMin = xor k min

insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v (IntMap m0) = IntMap $ start (TKey k0) m0
  where
    start :: TKey t u -> IntMap_ t u a -> IntMap_ t u a
    start k Empty = NonEmpty k v Tip
    start k (NonEmpty minK minV root)
      | k == minK = NonEmpty k v root
      | k > minK = NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) root
      | otherwise = NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root

    go :: Word -> TKey t u -> TKey t u -> Node t u a -> Node t u a
    go !xorCache !maxK !k Tip = Bin k v Tip Tip
    go xorCache maxK k (Bin minK minV l r)
      | minK < k =
         if xorCache < xorCacheMin
         then Bin minK minV l (go xorCache maxK k r)
         else Bin minK minV (go xorCacheMin (compKey minK) (compKey k) l) r
      | k < minK =
         if xor minK maxK < xorCacheMin
         then Bin k v Tip (Bin minK minV l r)
         else Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
      | otherwise = Bin minK v l r
     where
        xorCacheMin :: Word
        xorCacheMin = xor k minK

insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN !xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
  | xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
  | otherwise = Bin minK minV l (insertMaxN xorcache k v r)

The traverseWithKey function I wrote still uses alternating helpers. I think it would be possible to avoid this with an extra conditional; I don't know if that would hurt performance.

traverseWithKey
  :: forall f a b. Applicative f
  => (Key -> a -> f b) -> IntMap a -> f (IntMap b)
traverseWithKey _ (IntMap Empty) = pure empty
traverseWithKey f (IntMap (NonEmpty k v root)) =
  liftA2 (\v' root' -> IntMap (NonEmpty k v' root')) (f (getKey k) v) (goC root)
  where
    goC :: Node C P a -> f (Node C P b)
    goC Tip = pure Tip
    goC (Bin k v l r) = liftA3 (\r' l' v' -> Bin k v' l' r') (goC r) (goP l) (f (complement (getKey k)) v)

    goP :: Node P C a -> f (Node P C b)
    goP Tip = pure Tip
    goP (Bin k v l r) = liftA3 (Bin k) (f (getKey k) v) (goC l) (goP r)
treeowl commented 4 years ago

I can't figure out what LLVM is doing with this code.... I wonder if it actually expands it to yours, suggesting maybe I'm wasting my time. But benchmarking will tell, some day, maybe.

sjakobi commented 4 years ago

@treeowl Could you explain a bit how you can use complement to track minima and maxima?

Also, have you actually tested your code? I'm just stumped how it could possibly work.

treeowl commented 4 years ago

@sjakobi insert and lookup are well tested. I'll upload my validity checker later. I still don't understand how deletion works in @gereeter's code, so I couldn't get that working. I don't use complement to track minima and maxima. It's a trick that means we don't have to track them in all the same cases, and I think we need to do so in fewer overall, but that could be wrong.

The general gist is that we always store the minimum in each node, but that may ultimately represent its complement, depending on how many left branches we've taken (with the initial jump down considered a left branch).

treeowl commented 4 years ago

@sjakobi, here's my draft validity tester:

valid :: (Eq a, Show a) => IntMap a -> Property
valid m0@(IntMap m0_) =
   -- This will catch errors very reliably, but it's coarse-grained
   fromList (toList m0) === m0 .&&.
   -- This catches errors in a fine-grained way, but relies on being implemented
   -- right.
   start m0_
  where
    start :: IntMap_ P C a -> Property
    start Empty = property ()
    start (NonEmpty _minK _minV Tip) = property ()
    start (NonEmpty minK _minV (Bin maxK _maxV l r)) =
      go maxK (compKey minK) r .&&.
      go minK (compKey maxK) l

    go :: TKey t u -> TKey t u -> Node t u a -> Property
    go xmin xmax _
      | xmin >= xmax = counterexample "max not greater" False
    go _ _ Tip = property ()
    go xmin xmax (Bin minK _minV l r) =
      xor xmin minK > xor xmax minK .&&.
      counterexample "not between" (xmin < minK .&&. minK < xmax) .&&.
        go (compKey xmax) (compKey minK) l .&&.
        go minK xmax r
treeowl commented 4 years ago

Thanks to @gereeter's recent cleanups and explanations, I was finally able to adapt delete to the complementy version and get it to pass the tests.

Complementy version of compareMinBound:

{-# INLINE compareMinBound #-}
compareMinBound :: TKey t u -> TKey t u -> BoundOrdering
compareMinBound k min
    | k > min = InBound
    | k < min = OutOfBound
    | otherwise = Matched

data BoundOrdering = InBound | OutOfBound | Matched deriving (Eq)

Complementy deletion:

delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
    InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
    OutOfBound -> m
    Matched -> IntMap (nodeToCompMap root)
  where k = TKey k0

deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
    InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
            | otherwise              -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
    OutOfBound -> n
    Matched -> extractBinR l r
  where xorCacheMin = xor k min

extractBinR :: Node u t a -> Node t u a -> Node t u a
extractBinR Tip r = r
extractBinR (Bin max maxV innerL innerR) r =
    let NE min minV l = deleteMaxNode max maxV innerL innerR
    in Bin (compKey min) minV l r

deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV Tip Tip = NE min minV Tip
deleteMaxNode !min minV (Bin max maxV l r) Tip = NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
    let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
    in  NE max maxV (Bin min minV l inner)

nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
    let NE max maxV r = deleteMaxNode min minV innerL innerR
    in NonEmpty (compKey max) maxV r

data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)

That's just about half as much code.

gereeter commented 4 years ago

Quick benchmarks of the complementing code compared to not:

Benchmark     Runtime change  Original runtime
lookup hit     +13.45%        2.16e-04
lookup miss    +15.59%        2.28e-04
insert empty   +22.15%        2.39e-04
delete hit     +35.74%        1.87e-04
delete miss     +4.74%        7.49e-04
fromList        -1.39%        2.96e-04

Minimum         -1.39%        
Average        +14.43%        
Maximum        +35.74%

I just copied your code into a project, added fromList, and pruned the benchmarks that weren't implemented yet. I didn't make any huge effort to remove external interference, and of course things improve over time, so take these numbers with a bucket of salt.

sjakobi commented 4 years ago

Quick benchmarks of the complementing code compared to not:

The "original" is your code for #340, right?

gereeter commented 4 years ago

Yes.

treeowl commented 4 years ago

Do you think you could upload whatever you used to run that benchmark so I can play with it?

treeowl commented 4 years ago

I find the size of these differences you report rather shocking, especially for deletion hit.

gereeter commented 4 years ago
src/IntMap.hs ```haskell {-# LANGUAGE BangPatterns, RankNTypes, StandaloneDeriving, FlexibleInstances, ScopedTypeVariables #-} module IntMap where import Data.Bits (complement) import qualified Data.Bits as Bits import qualified Data.List as List import Control.DeepSeq (NFData(..)) type Key = Int -- @P C@ tags a value as representing itself; -- @C P@ tags a value as representing its complement. data P data C -- A key tagged to indicate whether it represents itself -- or its complement. newtype TKey t u = TKey {getKey :: Key} deriving (Eq, Ord) -- Each node stores the minimum value of that subtree. That may actually be -- a *maximum* when considering what's represented. newtype IntMap a = IntMap (IntMap_ P C a) deriving (Eq, Show) data IntMap_ t u a = NonEmpty {-# UNPACK #-} !(TKey t u) a !(Node u t a) | Empty deriving (Eq) data Node t u a = Bin {-# UNPACK #-} !(TKey t u) a !(Node u t a) !(Node t u a) | Tip deriving (Eq) data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a) deriving instance Show a => Show (IntMap_ P C a) deriving instance Show a => Show (IntMap_ C P a) deriving instance Show a => Show (Node P C a) deriving instance Show a => Show (Node C P a) instance NFData a => NFData (IntMap a) where rnf (IntMap m) = rnf m instance NFData a => NFData (IntMap_ t u a) where rnf Empty = () rnf (NonEmpty _ v root) = rnf v `seq` rnf root instance NFData a => NFData (Node t u a) where rnf Tip = () rnf (Bin _ v l r) = rnf v `seq` rnf l `seq` rnf r -- This type is just used for convenience when defining the Show instances for TKey. data KeyRep = PKey !Key | CKey !Key deriving Show instance Show (TKey P C) where showsPrec p (TKey k) = showsPrec p (PKey k) instance Show (TKey C P) where showsPrec p (TKey k) = showsPrec p (CKey (complement k)) -- | Take the complement of a key compKey :: TKey t u -> TKey u t compKey (TKey k) = TKey (complement k) xor :: TKey t u -> TKey t u -> Word xor (TKey k) (TKey b) = fromIntegral $ Bits.xor k b empty = IntMap Empty fromList :: [(Key, a)] -> IntMap a fromList = List.foldl' (flip (uncurry insert)) (IntMap Empty) fromAscList = fromList fromDistinctAscList = fromList lookup :: forall a. Key -> IntMap a -> Maybe a lookup k0 (IntMap m0) = start (TKey k0) m0 where start :: TKey t u -> IntMap_ t u a -> Maybe a start !k Empty = Nothing start k (NonEmpty min minV node) | k < min = Nothing | k == min = Just minV | otherwise = go (compKey k) (xor k min) node go :: TKey t u -> Word -> Node t u a -> Maybe a go !k !_ Tip = Nothing go k xorCache (Bin min minV l r) | k > min = if xorCache < xorCacheMin then go k xorCache r else go (compKey k) xorCacheMin l | k < min = Nothing | otherwise = Just minV where xorCacheMin = xor k min insert :: forall a. Key -> a -> IntMap a -> IntMap a insert k0 v (IntMap m0) = IntMap $ start (TKey k0) m0 where start :: TKey t u -> IntMap_ t u a -> IntMap_ t u a start k Empty = NonEmpty k v Tip start k (NonEmpty minK minV root) | k == minK = NonEmpty k v root | k > minK = NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) root | otherwise = NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root go :: Word -> TKey t u -> TKey t u -> Node t u a -> Node t u a go !xorCache !maxK !k Tip = Bin k v Tip Tip go xorCache maxK k (Bin minK minV l r) | minK < k = if xorCache < xorCacheMin then Bin minK minV l (go xorCache maxK k r) else Bin minK minV (go xorCacheMin (compKey minK) (compKey k) l) r | k < minK = if xor minK maxK < xorCacheMin then Bin k v Tip (Bin minK minV l r) else Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r | otherwise = Bin minK v l r where xorCacheMin :: Word xorCacheMin = xor k minK insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a insertMaxN !xorcache k v Tip = Bin k v Tip Tip insertMaxN xorcache k v (Bin minK minV l r) | xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip | otherwise = Bin minK minV l (insertMaxN xorcache k v r) {-# INLINE compareMinBound #-} compareMinBound :: TKey t u -> TKey t u -> BoundOrdering compareMinBound k min | k > min = InBound | k < min = OutOfBound | otherwise = Matched data BoundOrdering = InBound | OutOfBound | Matched deriving (Eq) delete :: forall a. Key -> IntMap a -> IntMap a delete !_ (IntMap Empty) = IntMap Empty delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root)) OutOfBound -> m Matched -> IntMap (nodeToCompMap root) where k = TKey k0 deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a deleteNode !_ !_ Tip = Tip deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r) | otherwise -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r OutOfBound -> n Matched -> extractBinR l r where xorCacheMin = xor k min extractBinR :: Node u t a -> Node t u a -> Node t u a extractBinR Tip r = r extractBinR (Bin max maxV innerL innerR) r = let NE min minV l = deleteMaxNode max maxV innerL innerR in Bin (compKey min) minV l r deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a deleteMaxNode !min minV Tip Tip = NE min minV Tip deleteMaxNode !min minV (Bin max maxV l r) Tip = NE (compKey max) maxV (Bin min minV r l) deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) = let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR in NE max maxV (Bin min minV l inner) nodeToCompMap :: Node t u a -> IntMap_ u t a nodeToCompMap Tip = Empty nodeToCompMap (Bin min minV innerL innerR) = let NE max maxV r = deleteMaxNode min minV innerL innerR in NonEmpty (compKey max) maxV r ```
app/Main.hs ```haskell module Main where import Control.DeepSeq (rnf) import Control.Exception (evaluate) import Gauge (bench, defaultMain, whnf) import Data.List (foldl') import qualified IntMap as M import Data.Maybe (fromMaybe) import Prelude hiding (lookup) main = do let m = M.fromAscList elems :: M.IntMap Int evaluate $ rnf [m] evaluate $ rnf missKeys defaultMain [ bench "lookup hit" $ whnf (lookup keys) m , bench "lookup miss" $ whnf (lookup missKeys) m , bench "insert empty" $ whnf (ins elems) M.empty , bench "delete hit" $ whnf (del keys) m , bench "delete miss" $ whnf (del missKeys) m , bench "fromList" $ whnf M.fromList elems , bench "fromAscList" $ whnf M.fromAscList elems , bench "fromDistinctAscList" $ whnf M.fromDistinctAscList elems ] where elems = zip keys values keys = [1,3..2^13] missKeys = [0,2..2^13] values = [1,3..2^13] sum k v1 v2 = k + v1 + v2 consPairL xs k v = (k, v) : xs consPairR k v xs = (k, v) : xs add3 :: Int -> Int -> Int -> Int add3 x y z = x + y + z {-# INLINE add3 #-} lookup :: [Int] -> M.IntMap Int -> Int lookup xs m = foldl' (\n k -> fromMaybe n (M.lookup k m)) 0 xs ins :: [(Int, Int)] -> M.IntMap Int -> M.IntMap Int ins xs m = foldl' (\m (k, v) -> M.insert k v m) m xs data PairS a b = PS !a !b del :: [Int] -> M.IntMap Int -> M.IntMap Int del xs m = foldl' (\m k -> M.delete k m) m xs maybeDel :: Int -> Maybe Int maybeDel n | n `mod` 3 == 0 = Nothing | otherwise = Just n ```
Bench.cabal ```haskell cabal-version: 1.12 name: Bench version: 0.1.0.0 build-type: Simple library exposed-modules: IntMap hs-source-dirs: src ghc-options: -O2 build-depends: base >=4.7 && <5 , deepseq default-language: Haskell2010 executable Bench-exe main-is: Main.hs hs-source-dirs: app ghc-options: -O2 build-depends: Bench , base >=4.7 && <5 , deepseq , gauge default-language: Haskell2010 ```
stack.yaml ``` resolver: ghc-8.8.1 packages: - . extra-deps: - gauge-0.2.5@sha256:8d60450bdec985c146d5632d4f5a8d60cec27d71ba6787ed1ee64d945d4c7c33,3923 - basement-0.0.11@sha256:af43e2e334e515b52ca309919b135c51b5e9411e6d4c68d0e8950d61eb5f25d1,5711 - vector-0.12.0.3@sha256:1422b0bcf4e7675116ca8d9f473bf239850c58c4518a56010e3bfebeac345ace,7171 - primitive-0.7.0.0@sha256:ee352d97cc390d8513530029a2213a95c179a66c406906b18d03702c1d9ab8e5,3416 ```
treeowl commented 4 years ago

Thanks! That's very helpful. I'm going to play with it a bit later. One thing I want to do is see how the native code generator compares to LLVM for both implementations. A very brief look earlier suggested they were doing somewhat different things with the complements, but I don't understand what LLVM was doing.

treeowl commented 4 years ago

I've improved the complementy code somewhat, but I still can't match yours, particularly for lookup. It may just fundamentally be slower. Oh well. The improvements to insertion and deletion, in case someone wants to swing around some day and try some other magic:

insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v0 (IntMap m0) = IntMap $ start (TKey k0) v0 m0
  where
    start :: TKey t u -> a -> IntMap_ t u a -> IntMap_ t u a
    start !k v Empty = NonEmpty k v Tip
    start !k v (NonEmpty minK minV root) = case compareMinBound k minK of
      Matched -> NonEmpty k v root
      InBound -> NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) v root
      OutOfBound -> NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root

    go :: Word -> TKey t u -> TKey t u -> a -> Node t u a -> Node t u a
    go !_xorCache !_maxK !k v Tip = Bin k v Tip Tip
    go xorCache maxK k v (Bin minK minV l r) = case compareMinBound k minK of
      InBound
         | xorCache < xorCacheMin
         -> Bin minK minV l (go xorCache maxK k v r)
         | otherwise
         -> Bin minK minV (go xorCacheMin (compKey minK) (compKey k) v l) r
      OutOfBound
         | xor minK maxK < xorCacheMin
         -> Bin k v Tip (Bin minK minV l r)
         | otherwise
         -> Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
      Matched -> Bin minK v l r
     where
        xorCacheMin :: Word
        xorCacheMin = xor k minK

insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
  | xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
  | otherwise = Bin minK minV l (insertMaxN xorcache k v r)

delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
    InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
    OutOfBound -> m
    Matched -> IntMap (nodeToCompMap root)
  where k = TKey k0

deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
    InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
            | otherwise              -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
    OutOfBound -> n
    Matched -> extractBin l r
  where xorCacheMin = xor k min

extractBin :: Node u t a -> Node t u a -> Node t u a
extractBin Tip r = r
extractBin (Bin max maxV innerL innerR) r =
    let NE min minV l = deleteMaxNode max maxV innerL innerR
    in Bin (compKey min) minV l r

deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV m1 Tip = case m1 of
  Tip -> NE min minV Tip
  Bin max maxV l r -> NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
    let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
    in  NE max maxV (Bin min minV l inner)

nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
    let NE max maxV r = deleteMaxNode min minV innerL innerR
    in NonEmpty (compKey max) maxV r

data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)
sjakobi commented 4 years ago

I wonder whether the smaller, complement-based code could perform better than the code from #340 in "real" applications as opposed to micro-benchmarks. In micro-benchmarks, I'd suspect that the repeated application of the same functions causes even the huge functions from #340 to stay cached. This is pure speculation though, and I'm not very experienced with low-level performance things.

Unfortunately I haven't found any applications that make heavy use of IntMap and would be easier to benchmark and profile than GHC so far. The Clash compiler relies heavily on IntMap but it also depends on the ghc library which makes building it with a modified containers tricky.

Another reason why I've been slightly worried about #340 is that the increased amount of code ultimately also offers more potential for (performance) bugs.

I'm also not a big fan of the tendency to reduce lines of code via extra function arguments that must be inlined by the compiler for proper performance. IMHO that makes the code less readable and more fickle with regards to performance and compiler changes…

treeowl commented 4 years ago

@sjakobi the extra arguments are indeed unpleasant, though I'm not terribly worried about fragility in most cases. I really don't like having any more source code duplication than necessary. What we'd really like is probably something like

lookup# :: Key -> IntMap a -> (# (##) | a #)
lookup k m = case lookup# k m of
  (# | a #) -> Just a
  _ -> Nothing
{-# INLINE lookup #-}

member k m = case lookup# k m of
  (# | _ #) -> True
  _ -> False

But the compat story is something of a mess and will surely involve double-barreled continuations anyway. We still might want to do it though.

treeowl commented 4 years ago

I just had a radically simple idea that definitely has its own trade-offs but might be worth considering if it can actually be implemented easily.

data IntMap a
  = Bin
     { _minK :: !Int
       _minV :: a
       _lft :: !(IntMap a)
       _maxK :: !Int
       _maxV :: a
       _rght :: !(IntMap a) }
  | Tip
  | Single !Int a

This is quite an enormous Bin node, which is a bit scary, but it seems kind of nice in some ways:

  1. No more mess of alternating.
  2. No more special monkey business at the root.
  3. No more implicit information flowing down.
  4. Even more compact representation.
jwaldmann commented 4 years ago

haven't found any applications that make heavy use of IntMap

I made this a while ago: https://gitlab.imn.htwk-leipzig.de/waldmann/containers-benchmark

sjakobi commented 4 years ago

@treeowl Could you demonstrate a bit how this structure works? How do minimum and maximum keys relate to those in the children?


@jwaldmann That looks very useful! We should try to get that into the containers source tree somehow.

To build this with a custom containers, I added this cabal.project:

packages: ., ../containers/containers

and ran

cabal test -O2

(The benchmark executable is the main testsuite)

Comparing 205e2a8591e57ee465bcc170b107a2463740396f (the current HEAD commit in #340) against #340's base commit f7e27e6e, the new IntMap seems slightly slower: On a quiet machine I'm measuring 3.5s vs. 3.6s (using bench to produce timings).

Profiling seems to put the blame mostly on Data.IntMap.Lazy.{union,difference}With:

image

Interestingly, allocations for unionWith seem to have decreased remarkably!

The profiles are attached: base-O2.prof.txt, gereeter-O2.prof.txt.

jwaldmann commented 4 years ago

Despite the name, this is also testing IntMap: https://github.com/jwaldmann/containers/blob/intset%3Dword/containers-tests/benchmarks/IntSet.hs#L53

sjakobi commented 4 years ago

@jwaldmann This also looks very interesting. It would be very nice if you could make a PR for your benchmark patches!

treeowl commented 4 years ago

It occurs to me that fgl may also be a good place to look for benchmarks.

sjakobi commented 4 years ago

It occurs to me that fgl may also be a good place to look for benchmarks.

Maybe. I don't see any benchmarks in the repo though.

sjakobi commented 4 years ago

It occurs to me that fgl may also be a good place to look for benchmarks.

Maybe. I don't see any benchmarks in the repo though.

rdf relies on fgl and has benchmarks… might be worth a spin…

jwaldmann commented 4 years ago

on using fgl via rdf: yes - but in their current state, rdf benchmarks take ages, because of[500,1000..100000] in https://github.com/TravisWhitaker/rdf/blob/master/bench/Main.hs#L42 . I replaced this list with map (10^) [3..5] and got

        Tue Jan 21 11:37 2020 Time and Allocation Profiling Report  (Final)

           bench-rdf-prof +RTS -p -s -h -RTS

        total time  =       59.60 secs   (59600 ticks @ 1000 us, 1 processor)
        total alloc = 75,388,764,824 bytes  (excludes profiling overheads)

COST CENTRE         MODULE                      SRC                                                  %time %alloc

encodeIRI           Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(55,1)-(63,25)         15.3   19.5
parseNQuads         Data.RDF.Parser.NQuads      src/Data/RDF/Parser/NQuads.hs:(66,1)-(68,22)          12.3   15.6
option              Data.Attoparsec.Combinator  Data/Attoparsec/Combinator.hs:90:1-25                 11.5   10.1
maybeBuilder        Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(46,1)-(49,58)          9.5    4.6
parseHost           Data.RDF.Internal           src/Data/RDF/Internal.hs:(311,1)-(312,56)              7.5    5.4
encodeEscapedIRI    Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(51,1)-(53,21)          7.0    3.4
parseLiteralBody    Data.RDF.Internal           src/Data/RDF/Internal.hs:(379,1)-(399,26)              6.2   10.5
encodeLiteral       Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(72,1)-(77,28)          4.6    3.5
parseBlankNodeLabel Data.RDF.Internal           src/Data/RDF/Internal.hs:(362,1)-(371,31)              2.8    2.5
parseScheme         Data.RDF.Internal           src/Data/RDF/Internal.hs:(288,1)-(295,33)              2.7    2.1
encodeQuad          Data.RDF.Encoder.NQuads     src/Data/RDF/Encoder/NQuads.hs:(43,1)-(53,49)          2.6    3.4
encodeBlankNode     Data.RDF.Encoder.Common     src/Data/RDF/Encoder/Common.hs:(87,1)-(89,24)          2.2    1.4
nf                  Criterion.Measurement.Types src/Criterion/Measurement/Types.hs:(272,1)-(275,27)    2.1    0.9
parseSubject        Data.RDF.Internal           src/Data/RDF/Internal.hs:(338,1)-(342,68)              1.8    0.7
eitherResult        Data.Attoparsec.Text.Lazy   Data/Attoparsec/Text/Lazy.hs:(99,1)-(101,77)           1.8    4.3
parseEscapedIRI     Data.RDF.Internal           src/Data/RDF/Internal.hs:358:1-54                      1.6    1.4
encodeRDFGraph      Data.RDF.Encoder.NQuads     src/Data/RDF/Encoder/NQuads.hs:(55,1)-(59,50)          1.1    1.4
parse               Data.Attoparsec.Text.Lazy   Data/Attoparsec/Text/Lazy.hs:(79,1)-(88,58)            0.8    1.1

I am not seeing any IntMap (but some HashMap).

gereeter commented 4 years ago

The profiles are attached:

I may be misunderstanding something about how the profiling works, but I'm very concerned to see boundKey and unbox in there. Is the profiling adding some overhead to those functions and/or stopping them from being inlined? They should literally not exist: boundKey is unwrapping a newtype (i.e. doing nothing in Core), and unbox is only applied in cases where the key came from Node or IntMap_, where the integer was unboxed anyway and there is nothing to do. I've never seen them when looking at Core.

gereeter commented 4 years ago

stopping them from being inlined

I just realized that neither of those functions has INLINE explicitly written out, and the profiling documentation says it will add cost centers to all functions not marked INLINE.

treeowl commented 4 years ago

The profiles are attached:

I may be misunderstanding something about how the profiling works, but I'm very concerned to see boundKey and unbox in there. Is the profiling adding some overhead to those functions and/or stopping them from being inlined? They should literally not exist: boundKey is unwrapping a newtype (i.e. doing nothing in Core), and unbox is only applied in cases where the key came from Node or IntMap_, where the integer was unboxed anyway and there is nothing to do. I've never seen them when looking at Core.

Overly aggressive cost-center profiling (e.g., -fprof-auto) will definitely hurt performance. -fprof-auto-exported is likely better, but make sure the box/unbox primitives are in a module not compiled like that. The safest bet is to add all the cost centers manually.

gereeter commented 4 years ago

After slapping some INLINEs and -fprof-auto-exported around (and running into the fact that you can't INLINE field accessors like boundKey), I got:

`master` branch ``` total time = 7.01 secs (7013 ticks @ 1000 us, 1 processor) total alloc = 5,308,097,088 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc split Matchbox.Chain src/Matchbox/Chain.hs:(34,1)-(36,34) 25.9 17.8 insert Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(517,1)-(525,33) 10.5 2.3 value Matchbox.Automaton src/Matchbox/Automaton.hs:(127,1)-(132,18) 8.5 0.7 unionWithKey Data.IntMap.Internal src/Data/IntMap/Internal.hs:(1092,1)-(1093,84) 7.0 15.4 differenceWithKey Data.IntMap.Internal src/Data/IntMap/Internal.hs:(1126,1)-(1127,39) 6.4 4.1 delta_insert Matchbox.Automaton src/Matchbox/Automaton.hs:(163,1)-(176,25) 6.2 17.1 unions Matchbox.Semiring src/Matchbox/Semiring.hs:26:1-25 5.6 4.1 intersectionWithKey Data.IntMap.Internal src/Data/IntMap/Internal.hs:(1289,1)-(1290,102) 4.4 2.8 rfc Matchbox.Decomp src/Matchbox/Decomp.hs:(79,1)-(94,52) 4.3 7.1 balanceR Data.Map.Internal src/Data/Map/Internal.hs:(4102,1)-(4121,50) 4.2 13.4 map Data.IntMap.Internal src/Data/IntMap/Internal.hs:(2420,1)-(2424,26) 2.9 5.6 unionWith Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(982,1)-(988,35) 2.0 0.8 filterWithKey Data.IntMap.Internal src/Data/IntMap/Internal.hs:(2599,1)-(2603,44) 2.0 1.2 ```
`direct-bounded-intmap` branch ``` total time = 7.09 secs (7093 ticks @ 1000 us, 1 processor) total alloc = 5,092,794,112 bytes (excludes profiling overheads) COST CENTRE MODULE SRC %time %alloc split Matchbox.Chain src/Matchbox/Chain.hs:(34,1)-(36,34) 25.1 18.6 insert Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(517,1)-(525,33) 9.7 2.4 value Matchbox.Automaton src/Matchbox/Automaton.hs:(127,1)-(132,18) 8.4 0.8 unionWith Data.IntMap.Lazy src/Data/IntMap/Lazy.hs:518:1-36 7.0 9.2 differenceWith Data.IntMap.Lazy src/Data/IntMap/Lazy.hs:667:1-46 6.4 4.0 delta_insert Matchbox.Automaton src/Matchbox/Automaton.hs:(163,1)-(176,25) 6.0 17.8 unions Matchbox.Semiring src/Matchbox/Semiring.hs:26:1-25 5.3 4.3 balanceR Data.Map.Internal src/Data/Map/Internal.hs:(4102,1)-(4121,50) 4.6 13.9 rfc Matchbox.Decomp src/Matchbox/Decomp.hs:(79,1)-(94,52) 4.5 7.3 intersectionWith Data.IntMap.Lazy src/Data/IntMap/Lazy.hs:877:1-50 4.2 3.6 mapLazy Data.IntMap.Internal src/Data/IntMap/Internal.hs:1261:1-44 3.0 4.9 unionWith Data.Map.Strict.Internal src/Data/Map/Strict/Internal.hs:(982,1)-(988,35) 2.0 0.9 filter Data.IntMap.Internal src/Data/IntMap/Internal.hs:2072:1-34 2.0 2.7 timesM Matchbox.Height src/Matchbox/Height.hs:(68,1)-(80,54) 1.1 0.7 map Matchbox.Map.Enum src/Matchbox/Map/Enum.hs:96:1-31 1.0 0.0 unionDisjointL Data.IntMap.Internal src/Data/IntMap/Internal.hs:(1453,1)-(1460,58) 0.4 1.7 ```

Now, I'm not sure I got everything right and I wasn't running on a particularly quiet machine, but at least I didn't see any boundKey, unbox, or xorCaches floating around. Overall, I'm not too surprised by the results:

jwaldmann commented 4 years ago

Re: -fprof-auto

I think there are two ways of using profiling info:

Manually adding cost centers seems quite heavy. Statistical profiling would work without instrumentation? But what's the status? https://gitlab.haskell.org/ghc/ghc/wikis/dwarf/status#statistical-profiling

gereeter commented 4 years ago

Diff Haddocks to make sure no functions or instances have gone missing.

I just finished this myself and patched up the last holes that I found. Could someone else double check?