Open treeowl opened 4 years ago
@gereeter, @sjakobi, please let me know what should we added and what, in your opinions, should be done before and after the merge.
We may want to revisit the recent decision to make minimum
and maximum
traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified. Alternatively, I guess we can probably be very clever and keep track of a minimum from the left (updated only when the next element is smaller) and a minimum from the right (updated when the previous element is not larger), and pick whichever is smallest in the end, preferring the one from the left in a tie.
product
should definitely preserve the order of the elements in case the Num
instance is a non-commutative ring. sum
should probably preserve the order of the elements in case someone does something a bit weird, but I could be convinced to relax that.
@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.
I keep wondering whether making the tree go entirely in order is really the right way. The obvious alternative is for the order to alternate from level to level, so that keys go to (say) the left whenever they're closer to the bound stored in the current node, whether that's a lower bound or an upper bound.
That would (potentially) go along with alternating whether to flip the sign of the bound when storing it, and flipping the sign of the key on each step down. I don't know if that would actually work or not, but it strikes me as plausible.
Edit: the sign flipping would be done using complement
, because -minBound = minBound
. The idea is that k < max
is the same as complement max < complement k
.
We may want to revisit the recent decision to make
minimum
andmaximum
traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified.
My understanding is that the new IntMap
doesn't suffer from the weird order reversal for negative keys that the old one had. The new haddocks on IntMap_
explain it quite clearly IMHO.
In consequence I believe that we don't lose anything by keeping minimum
and maximum
in-order.
We may want to revisit the recent decision to make
minimum
andmaximum
traverse strictly left to right; I believe we'll use less stack space here if we make their traversal order explicitly unspecified.My understanding is that the new
IntMap
doesn't suffer from the weird order reversal for negative keys that the old one had. The new haddocks onIntMap_
explain it quite clearly IMHO.In consequence I believe that we don't lose anything by keeping
minimum
andmaximum
in-order.
We don't get the weird order reversal of children, but we now have values in internal nodes, and their positions in the order depend on their level. To go strictly in order from a node whose bound is a maximum, we have to push the right subtree and the bound's associated value onto the stack before descending the left subtree. I don't know for sure if we can improve matters or not, but we might save a little by dealing with those values "as they come" to the extent we can. That surely applies to minimum
and maximum
, and I imagine it applies to sum
and product
too if we're careful....
Generate a final benchmark comparison. I prefer to see these interleaved
My latest commits that have benchmark data just used the output of bench-cmp.pl
. Is that fine?
Rewrite
IntSet
to match.
Tricky: I actually did this back when the work was out-of-tree, and although I'm sure with time (and newer GHC) one could press the optimizations more and make different design choices, it was slower than stock IntSet
. Probably worth trying again, but there is a large space of choices to try out:
Check whether the restructuring in #658 is a good idea
And #653. I was putting it off as non-essential, but especially since fromList
is one of only two regressions according to my up-to-date benchmarks (oddly, not insert
even though fromList
is just a loop over insert
), I was keeping it in mind. But really, this can be post-merge.
Continue to add internal documentation. For example,
isSubmapOf
The two big documentation blocks I was working to get done before merging were about deletions and merges, and isSubmapOf
is basically a merge. Specific applications to specific functions can maybe wait until after merge, but those should probably be pre-merge.
We generally aim for 80 characters or fewer.
Aw, darn. I was wrapping my documentation at 100 characters. That at least is easy to fix. While there are definitely easy-to-wrap lines, I'm not so sure about some of the worst offenders. Maybe make an effort pre-merge but don't worry about it as a blocker?
Evaluate whether we should implement
alter
"by hand" as we used to. If not, then we should probably implement it usingalterF @Identity
.
I vaguely remember finding the trivial two-pass algorithm surprisingly faster that a specialized implementation? But it has been a while, and my memory is fuzzy. This will probably be pre-merge just because I'm curious now.
@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.
Jonathan "gereeter" S.
@gereeter, @sjakobi, please let me know what should [be] added
Pre-merge:
union
. I have a few ideas and experiments to run, but it has the worst regressions of the bunch: on my machine (Linux, GHC 8.8.1) it gets near 50% slower on the worst cases (although it does get faster on most cases).runMissingAll
. As I mentioned in https://github.com/haskell/containers/pull/340#discussion_r361549976, the merge tactics can't use the main functions since they need access to all parts of the mutual recursion, but it is fairly easy to extract the top-level part of the WhenMissing
bundle.isSubmapOf
can be more thoroughly tested: the standard 100 tests don't get close to hitting all the cases in a function that is designed to short-circuit everywhere.)Post-merge:
merge
. It can be a lot better than mergeA
due to being pure (well, the optimizations probably would apply to commutative monads, too). See how different the implementation of mergeA
and union
is, for example. Post-merge since it is a fairly niche function, works, and isn't used to implement union
, difference
, and intersection
yet.fromAscList
and friends. See https://github.com/haskell/containers/pull/340#issuecomment-247811863.Bound
idea from https://github.com/haskell/containers/pull/340#issuecomment-248095218, but didn't implement Carry
. As I said, however, I don't think that it helps that much. I'd love to see some sort of linearity analysis instead, but that is way more complicated.runMissingAll :: WhenMissing f a b -> IntMap a -> f (IntMap b)
, plus maybe a pure varianttraverseMaybeWithKey :: (Key -> a -> f (Maybe b)) -> IntMap a -> f (IntMap b)
I don't know for sure if we can improve matters or not, but we might save a little by dealing with those values "as they come" to the extent we can.
That's why I implemented elem
, although I suspect the win is greater in a case like that since you can terminate early as opposed to just having less on the stack. It is also kind of the reason why merge
can be so much better than mergeA
.
Additions for "before merge":
IntMap
without (undue) performance degradation. (I can trigger the build jobs for this)
- [ ] Check whether the restructuring in #658 is a good idea for the new representation and, if so, reinstate it.
Can someone estimate what kind of speedup we can expect from this? If it's, say, < 20%, I'd say it can wait.
- [ ] Continue to add internal documentation. For example,
isSubmapOf
and the like are pretty confusing.
+0.5 for adding documentation before the merge – just in case @gereeter gets hit by a bus.
- [ ] Shorten long lines when reasonable. We generally aim for 80 characters or fewer. There's flexibility for unusual situations, but this code has some very long lines that could be broken cleanly.
+0.1 for doing this before the merge in order to reduce noise in git blame
. Not a blocker IMO.
- [ ] Use
liftA*
inmergeA
.
+0.1 as above
- [ ] Evaluate whether we can get the same performance using
foldMapDefault
,fmapDefault
, and equivalent tricks for keyed versions. That would cut down on source code.
This can easily be delayed IMHO
Generate a final benchmark comparison. I prefer to see these interleaved
My latest commits that have benchmark data just used the output of
bench-cmp.pl
. Is that fine?
Yeah, that's fine.
Rewrite
IntSet
to match.Tricky: I actually did this back when the work was out-of-tree, and although I'm sure with time (and newer GHC) one could press the optimizations more and make different design choices, it was slower than stock
IntSet
. Probably worth trying again, but there is a large space of choices to try out:
- When a min or max shows up higher in the tree, should it be repeated in the bitmaps?
- Should the bitmaps be aligned, representing series of integers with the same bit prefix, or should they start at the min/max for the node to not have waste zeros?
- Is a singleton node represented as itself with two empty children or as a one-hot bitmap?
- etc.
The most conservative approach, I think, is to match IntMap
very directly, storing aligned bitmaps where IntMap
stores values and using masked elements for keys. Did you try that version?
Check whether the restructuring in #658 is a good idea
And #653. I was putting it off as non-essential, but especially since
fromList
is one of only two regressions according to my up-to-date benchmarks (oddly, notinsert
even thoughfromList
is just a loop overinsert
), I was keeping it in mind. But really, this can be post-merge.
fromList
regression is rather surprising!Continue to add internal documentation. For example,
isSubmapOf
The two big documentation blocks I was working to get done before merging were about deletions and merges, and
isSubmapOf
is basically a merge. Specific applications to specific functions can maybe wait until after merge, but those should probably be pre-merge.
You've definitely gone a long way toward getting documentation in shape. Thanks.
We generally aim for 80 characters or fewer.
Aw, darn. I was wrapping my documentation at 100 characters. That at least is easy to fix. While there are definitely easy-to-wrap lines, I'm not so sure about some of the worst offenders. Maybe make an effort pre-merge but don't worry about it as a blocker?
Hard-to-wrap lines are definitely a low priority.
Evaluate whether we should implement
alter
"by hand" as we used to. If not, then we should probably implement it usingalterF @Identity
.I vaguely remember finding the trivial two-pass algorithm surprisingly faster that a specialized implementation? But it has been a while, and my memory is fuzzy. This will probably be pre-merge just because I'm curious now.
The generic version will always be better in the "key absent, don't insert" case, because we don't have anything like a fast setjmp
/longjmp
. In other cases, specialized versions will likely be better.
@gereeter, please also let me know how you wish to be credited in the changelog and release announcement.
Jonathan "gereeter" S.
@gereeter, @sjakobi, please let me know what should [be] added
I'll add those.
Additions for "before merge":
- Check that GHC can use the new
IntMap
without (undue) performance degradation. (I can trigger the build jobs for this)
Thanks.
- [ ] Check whether the restructuring in #658 is a good idea for the new representation and, if so, reinstate it.
Can someone estimate what kind of speedup we can expect from this? If it's, say, < 20%, I'd say it can wait.
With the previous representation, the performance improvement was small, but the code was (in my opinion) much easier to understand.
- [ ] Evaluate whether we can get the same performance using
foldMapDefault
,fmapDefault
, and equivalent tricks for keyed versions. That would cut down on source code.This can easily be delayed IMHO
Of course it can, but it shouldn't take long to check and reducing the enormous quantity of source code would be very nice.
What if the stored bounds in nodes are all minima, and the passed in ones are all maxima, with complements being taken (of bounds and keys) as necessary to maintain this? Ignoring the top level for a moment, lookup would go like this:
We have the stored minimum, the passed maximum, and the key. We xor
the key with each and compare, choosing the left child if closer to the minimum and vice versa. Now in the left child, we would naturally store a new maximum, but instead we store its complement as a new minimum, and pass down the complement of the minimum and of the key. In the right child, we store the new minimum and pass down the key and maximum unchanged.
Would that work, @gereeter? I'd really love to cut down on the code size here, but my intuition for the structure is not yet wonderful.
Would that work, @gereeter?
Yes, technically, but I'd expect a large performance hit for needlessly complementing all over the place. And while I have the bias of understanding what is going on now, I think it would be a lot more confusing, too.
The biggest thing to do for cutting down on code size is unifying functions that do similar things (Lazy
and Strict
, variants of one function, etc.). With judicious INLINE
, this doesn't affect things at all. If we need to deduplicate code for different types of nodes, the type class approach seems much more appropriate (with the two-parameter hack to avoid type families), and performance can be kept the same with a bunch of SPECIALIZE
pragmas.
(On that note: local functions currently don't have type signatures for the most part because of the lack of ScopedTypeVariables
. Is that something we can get away with using? I see that it is gated in Data.Sequence.Internal
.)
my intuition for the structure is not yet wonderful.
Anything in particular? Any confusion is an opportunity for better documentation in my mind.
Would that work, @gereeter?
Yes, technically, but I'd expect a large performance hit for needlessly complementing all over the place.
I'd be surprised. Basic operations like complement
are usually extremely cheap; I'd expect their cost to be totally trivial compared to chasing pointers.
And while I have the bias of understanding what is going on now, I think it would be a lot more confusing, too.
That seems a much more likely problem. But the source duplication of the current scheme is also a potential source of errors and confusion. It seems to take some time to figure out what flops and how.
The biggest thing to do for cutting down on code size is unifying functions that do similar things (
Lazy
andStrict
, variants of one function, etc.). With judiciousINLINE
, this doesn't affect things at all. If we need to deduplicate code for different types of nodes, the type class approach seems much more appropriate (with the two-parameter hack to avoid type families), and performance can be kept the same with a bunch ofSPECIALIZE
pragmas.
That certainly seems a promising approach to the source duplication, though it does nothing for object code size.
(On that note: local functions currently don't have type signatures for the most part because of the lack of
ScopedTypeVariables
. Is that something we can get away with using? I see that it is gated inData.Sequence.Internal
.)
Let's keep those gated for now … the current approach to scoped type variables is controversial, to say the least, with both Eisenberg and Kmett preferring a different sort. Again, a macro in containers.h
is the way to do it.
my intuition for the structure is not yet wonderful.
Anything in particular? Any confusion is an opportunity for better documentation in my mind.
I think it could be helpful to have a few diagrams demonstrating basic operations, but those can be tough in text. Feel free to include something separate in HTML+SVG, LaTeX, or whatever, as long as it's not too big. But that can certainly wait till after the merge.
Let's keep those gated for now
Ah, that could be a problem for reducing duplication. Without type signatures, GHC won't infer maximally polymorphic types for local definitions and we can't use SPECIALIZE
pragmas, at least as far as I've figured out.
You may need to tie it to the type family usage to avoid a nasty NoMonoLocalBinds
. And you don't need SPECIALIZE
in the fallback.
Other random API "omission": isSubmapOfWithKey :: (Key -> a -> b -> Bool) -> IntMap a -> IntMap b -> Bool
. I don't see a use for it, but it would match all the other WithKey
functions. It is possible to emulate with mergeA
and a constant Applicative
, but as with a lot of cases like that, you want to short-circuit based on the uppermost values.
Other random API "omission":
isSubmapOfWithKey :: (Key -> a -> b -> Bool) -> IntMap a -> IntMap b -> Bool
. I don't see a use for it, but it would match all the otherWithKey
functions. It is possible to emulate withmergeA
and a constantApplicative
, but as with a lot of cases like that, you want to short-circuit based on the uppermost values.
Yes, we should consider that, ideally with input from the libraries list. I wouldn't call it a priority myself.
I'd like to clarify something: the reason I care about object code size is not that it takes up space on disk or in RAM, but that it takes up space in cache and involves more code in each operation. This can have important performance consequences. I played around a bit and got insert
and lookup
working with the alternative approach. I made a brief unsuccessful attempt at delete
; I'll try again when I get a chance. With this approach, using a type family introduces constraints of the form t ~ Comp (Comp t)
all over the place, so I used the two-variable approach instead.
-- @P C@ tags a value as representing itself;
-- @C P@ tags a value as representing its complement.
data P
data C
-- A key tagged to indicate whether it represents itself
-- or its complement.
newtype TKey t u = TKey {getKey :: Key}
deriving (Eq, Ord)
-- Each node stores the minimum value of that subtree. That may actually be
-- a *maximum* when considering what's represented.
newtype IntMap a = IntMap (IntMap_ P C a) deriving (Eq, Show)
data IntMap_ t u a
= NonEmpty {-# UNPACK #-} !(TKey t u) a !(Node u t a) | Empty deriving (Eq)
data Node t u a = Bin {-# UNPACK #-} !(TKey t u) a !(Node u t a) !(Node t u a) | Tip deriving (Eq)
deriving instance Show a => Show (IntMap_ P C a)
deriving instance Show a => Show (IntMap_ C P a)
deriving instance Show a => Show (Node P C a)
deriving instance Show a => Show (Node C P a)
-- This type is just used for convenience when defining the Show instances for TKey.
data KeyRep = PKey !Key | CKey !Key deriving Show
instance Show (TKey P C) where
showsPrec p (TKey k) = showsPrec p (PKey k)
instance Show (TKey C P) where
showsPrec p (TKey k) = showsPrec p (CKey (complement k))
-- | Take the complement of a key
compKey :: TKey t u -> TKey u t
compKey (TKey k) = TKey (complement k)
xor :: TKey t u -> TKey t u -> Word
xor (TKey k) (TKey b) = fromIntegral $ Bits.xor k b
lookup :: forall a. Key -> IntMap a -> Maybe a
lookup k0 (IntMap m0) = start (TKey k0) m0
where
start :: TKey t u -> IntMap_ t u a -> Maybe a
start !k Empty = Nothing
start k (NonEmpty min minV node)
| k < min = Nothing
| k == min = Just minV
| otherwise = go (compKey k) (xor k min) node
go :: TKey t u -> Word -> Node t u a -> Maybe a
go !k !_ Tip = Nothing
go k xorCache (Bin min minV l r)
| k > min = if xorCache < xorCacheMin
then go k xorCache r
else go (compKey k) xorCacheMin l
| k < min = Nothing
| otherwise = Just minV
where xorCacheMin = xor k min
insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v (IntMap m0) = IntMap $ start (TKey k0) m0
where
start :: TKey t u -> IntMap_ t u a -> IntMap_ t u a
start k Empty = NonEmpty k v Tip
start k (NonEmpty minK minV root)
| k == minK = NonEmpty k v root
| k > minK = NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) root
| otherwise = NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root
go :: Word -> TKey t u -> TKey t u -> Node t u a -> Node t u a
go !xorCache !maxK !k Tip = Bin k v Tip Tip
go xorCache maxK k (Bin minK minV l r)
| minK < k =
if xorCache < xorCacheMin
then Bin minK minV l (go xorCache maxK k r)
else Bin minK minV (go xorCacheMin (compKey minK) (compKey k) l) r
| k < minK =
if xor minK maxK < xorCacheMin
then Bin k v Tip (Bin minK minV l r)
else Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
| otherwise = Bin minK v l r
where
xorCacheMin :: Word
xorCacheMin = xor k minK
insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN !xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
| xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
| otherwise = Bin minK minV l (insertMaxN xorcache k v r)
The traverseWithKey
function I wrote still uses alternating helpers. I think it would be possible to avoid this with an extra conditional; I don't know if that would hurt performance.
traverseWithKey
:: forall f a b. Applicative f
=> (Key -> a -> f b) -> IntMap a -> f (IntMap b)
traverseWithKey _ (IntMap Empty) = pure empty
traverseWithKey f (IntMap (NonEmpty k v root)) =
liftA2 (\v' root' -> IntMap (NonEmpty k v' root')) (f (getKey k) v) (goC root)
where
goC :: Node C P a -> f (Node C P b)
goC Tip = pure Tip
goC (Bin k v l r) = liftA3 (\r' l' v' -> Bin k v' l' r') (goC r) (goP l) (f (complement (getKey k)) v)
goP :: Node P C a -> f (Node P C b)
goP Tip = pure Tip
goP (Bin k v l r) = liftA3 (Bin k) (f (getKey k) v) (goC l) (goP r)
I can't figure out what LLVM is doing with this code.... I wonder if it actually expands it to yours, suggesting maybe I'm wasting my time. But benchmarking will tell, some day, maybe.
@treeowl Could you explain a bit how you can use complement
to track minima and maxima?
Also, have you actually tested your code? I'm just stumped how it could possibly work.
@sjakobi insert
and lookup
are well tested. I'll upload my validity checker later. I still don't understand how deletion works in @gereeter's code, so I couldn't get that working. I don't use complement
to track minima and maxima. It's a trick that means we don't have to track them in all the same cases, and I think we need to do so in fewer overall, but that could be wrong.
The general gist is that we always store the minimum in each node, but that may ultimately represent its complement, depending on how many left branches we've taken (with the initial jump down considered a left branch).
@sjakobi, here's my draft validity tester:
valid :: (Eq a, Show a) => IntMap a -> Property
valid m0@(IntMap m0_) =
-- This will catch errors very reliably, but it's coarse-grained
fromList (toList m0) === m0 .&&.
-- This catches errors in a fine-grained way, but relies on being implemented
-- right.
start m0_
where
start :: IntMap_ P C a -> Property
start Empty = property ()
start (NonEmpty _minK _minV Tip) = property ()
start (NonEmpty minK _minV (Bin maxK _maxV l r)) =
go maxK (compKey minK) r .&&.
go minK (compKey maxK) l
go :: TKey t u -> TKey t u -> Node t u a -> Property
go xmin xmax _
| xmin >= xmax = counterexample "max not greater" False
go _ _ Tip = property ()
go xmin xmax (Bin minK _minV l r) =
xor xmin minK > xor xmax minK .&&.
counterexample "not between" (xmin < minK .&&. minK < xmax) .&&.
go (compKey xmax) (compKey minK) l .&&.
go minK xmax r
Thanks to @gereeter's recent cleanups and explanations, I was finally able to adapt delete
to the complementy version and get it to pass the tests.
Complementy version of compareMinBound
:
{-# INLINE compareMinBound #-}
compareMinBound :: TKey t u -> TKey t u -> BoundOrdering
compareMinBound k min
| k > min = InBound
| k < min = OutOfBound
| otherwise = Matched
data BoundOrdering = InBound | OutOfBound | Matched deriving (Eq)
Complementy deletion:
delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
OutOfBound -> m
Matched -> IntMap (nodeToCompMap root)
where k = TKey k0
deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
| otherwise -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
OutOfBound -> n
Matched -> extractBinR l r
where xorCacheMin = xor k min
extractBinR :: Node u t a -> Node t u a -> Node t u a
extractBinR Tip r = r
extractBinR (Bin max maxV innerL innerR) r =
let NE min minV l = deleteMaxNode max maxV innerL innerR
in Bin (compKey min) minV l r
deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV Tip Tip = NE min minV Tip
deleteMaxNode !min minV (Bin max maxV l r) Tip = NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
in NE max maxV (Bin min minV l inner)
nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
let NE max maxV r = deleteMaxNode min minV innerL innerR
in NonEmpty (compKey max) maxV r
data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)
That's just about half as much code.
Quick benchmarks of the complementing code compared to not:
Benchmark Runtime change Original runtime
lookup hit +13.45% 2.16e-04
lookup miss +15.59% 2.28e-04
insert empty +22.15% 2.39e-04
delete hit +35.74% 1.87e-04
delete miss +4.74% 7.49e-04
fromList -1.39% 2.96e-04
Minimum -1.39%
Average +14.43%
Maximum +35.74%
I just copied your code into a project, added fromList
, and pruned the benchmarks that weren't implemented yet. I didn't make any huge effort to remove external interference, and of course things improve over time, so take these numbers with a bucket of salt.
Quick benchmarks of the complementing code compared to not:
The "original" is your code for #340, right?
Yes.
Do you think you could upload whatever you used to run that benchmark so I can play with it?
I find the size of these differences you report rather shocking, especially for deletion hit.
Thanks! That's very helpful. I'm going to play with it a bit later. One thing I want to do is see how the native code generator compares to LLVM for both implementations. A very brief look earlier suggested they were doing somewhat different things with the complements, but I don't understand what LLVM was doing.
I've improved the complementy code somewhat, but I still can't match yours, particularly for lookup
. It may just fundamentally be slower. Oh well. The improvements to insertion and deletion, in case someone wants to swing around some day and try some other magic:
insert :: forall a. Key -> a -> IntMap a -> IntMap a
insert k0 v0 (IntMap m0) = IntMap $ start (TKey k0) v0 m0
where
start :: TKey t u -> a -> IntMap_ t u a -> IntMap_ t u a
start !k v Empty = NonEmpty k v Tip
start !k v (NonEmpty minK minV root) = case compareMinBound k minK of
Matched -> NonEmpty k v root
InBound -> NonEmpty minK minV $ go (xor k minK) (compKey minK) (compKey k) v root
OutOfBound -> NonEmpty k v $ insertMaxN (xor k minK) (compKey minK) minV root
go :: Word -> TKey t u -> TKey t u -> a -> Node t u a -> Node t u a
go !_xorCache !_maxK !k v Tip = Bin k v Tip Tip
go xorCache maxK k v (Bin minK minV l r) = case compareMinBound k minK of
InBound
| xorCache < xorCacheMin
-> Bin minK minV l (go xorCache maxK k v r)
| otherwise
-> Bin minK minV (go xorCacheMin (compKey minK) (compKey k) v l) r
OutOfBound
| xor minK maxK < xorCacheMin
-> Bin k v Tip (Bin minK minV l r)
| otherwise
-> Bin k v (insertMaxN xorCacheMin (compKey minK) minV l) r
Matched -> Bin minK v l r
where
xorCacheMin :: Word
xorCacheMin = xor k minK
insertMaxN :: Word -> TKey t u -> a -> Node t u a -> Node t u a
insertMaxN xorcache k v Tip = Bin k v Tip Tip
insertMaxN xorcache k v (Bin minK minV l r)
| xor k minK < xorcache = Bin minK minV (Bin (compKey k) v r l) Tip
| otherwise = Bin minK minV l (insertMaxN xorcache k v r)
delete :: forall a. Key -> IntMap a -> IntMap a
delete !_ (IntMap Empty) = IntMap Empty
delete !k0 m@(IntMap (NonEmpty min minV root)) = case compareMinBound k min of
InBound -> IntMap (NonEmpty min minV (deleteNode (compKey k) (xor k min) root))
OutOfBound -> m
Matched -> IntMap (nodeToCompMap root)
where k = TKey k0
deleteNode :: TKey t u -> Word -> Node t u a -> Node t u a
deleteNode !_ !_ Tip = Tip
deleteNode !k !xorCache n@(Bin min minV l r) = case compareMinBound k min of
InBound | xorCache < xorCacheMin -> Bin min minV l (deleteNode k xorCache r)
| otherwise -> Bin min minV (deleteNode (compKey k) xorCacheMin l) r
OutOfBound -> n
Matched -> extractBin l r
where xorCacheMin = xor k min
extractBin :: Node u t a -> Node t u a -> Node t u a
extractBin Tip r = r
extractBin (Bin max maxV innerL innerR) r =
let NE min minV l = deleteMaxNode max maxV innerL innerR
in Bin (compKey min) minV l r
deleteMaxNode :: TKey t u -> a -> Node u t a -> Node t u a -> NonEmptyIntMap_ t u a
deleteMaxNode !min minV m1 Tip = case m1 of
Tip -> NE min minV Tip
Bin max maxV l r -> NE (compKey max) maxV (Bin min minV r l)
deleteMaxNode !min minV l (Bin innerMin innerMinV innerL innerR) =
let NE max maxV inner = deleteMaxNode innerMin innerMinV innerL innerR
in NE max maxV (Bin min minV l inner)
nodeToCompMap :: Node t u a -> IntMap_ u t a
nodeToCompMap Tip = Empty
nodeToCompMap (Bin min minV innerL innerR) =
let NE max maxV r = deleteMaxNode min minV innerL innerR
in NonEmpty (compKey max) maxV r
data NonEmptyIntMap_ t u a = NE {-# UNPACK #-} !(TKey t u) a !(Node t u a)
I wonder whether the smaller, complement
-based code could perform better than the code from #340 in "real" applications as opposed to micro-benchmarks. In micro-benchmarks, I'd suspect that the repeated application of the same functions causes even the huge functions from #340 to stay cached. This is pure speculation though, and I'm not very experienced with low-level performance things.
Unfortunately I haven't found any applications that make heavy use of IntMap
and would be easier to benchmark and profile than GHC so far. The Clash compiler relies heavily on IntMap
but it also depends on the ghc
library which makes building it with a modified containers
tricky.
Another reason why I've been slightly worried about #340 is that the increased amount of code ultimately also offers more potential for (performance) bugs.
I'm also not a big fan of the tendency to reduce lines of code via extra function arguments that must be inlined by the compiler for proper performance. IMHO that makes the code less readable and more fickle with regards to performance and compiler changes…
@sjakobi the extra arguments are indeed unpleasant, though I'm not terribly worried about fragility in most cases. I really don't like having any more source code duplication than necessary. What we'd really like is probably something like
lookup# :: Key -> IntMap a -> (# (##) | a #)
lookup k m = case lookup# k m of
(# | a #) -> Just a
_ -> Nothing
{-# INLINE lookup #-}
member k m = case lookup# k m of
(# | _ #) -> True
_ -> False
But the compat story is something of a mess and will surely involve double-barreled continuations anyway. We still might want to do it though.
I just had a radically simple idea that definitely has its own trade-offs but might be worth considering if it can actually be implemented easily.
data IntMap a
= Bin
{ _minK :: !Int
_minV :: a
_lft :: !(IntMap a)
_maxK :: !Int
_maxV :: a
_rght :: !(IntMap a) }
| Tip
| Single !Int a
This is quite an enormous Bin
node, which is a bit scary, but it seems kind of nice in some ways:
haven't found any applications that make heavy use of IntMap
I made this a while ago: https://gitlab.imn.htwk-leipzig.de/waldmann/containers-benchmark
@treeowl Could you demonstrate a bit how this structure works? How do minimum and maximum keys relate to those in the children?
@jwaldmann That looks very useful! We should try to get that into the containers
source tree somehow.
To build this with a custom containers
, I added this cabal.project
:
packages: ., ../containers/containers
and ran
cabal test -O2
(The benchmark executable is the main
testsuite)
Comparing 205e2a8591e57ee465bcc170b107a2463740396f (the current HEAD commit in #340) against #340's base commit f7e27e6e, the new IntMap
seems slightly slower: On a quiet machine I'm measuring 3.5s vs. 3.6s (using bench
to produce timings).
Profiling seems to put the blame mostly on Data.IntMap.Lazy.{union,difference}With
:
Interestingly, allocations for unionWith
seem to have decreased remarkably!
The profiles are attached: base-O2.prof.txt, gereeter-O2.prof.txt.
Despite the name, this is also testing IntMap: https://github.com/jwaldmann/containers/blob/intset%3Dword/containers-tests/benchmarks/IntSet.hs#L53
@jwaldmann This also looks very interesting. It would be very nice if you could make a PR for your benchmark patches!
It occurs to me that fgl
may also be a good place to look for benchmarks.
It occurs to me that
fgl
may also be a good place to look for benchmarks.
Maybe. I don't see any benchmarks in the repo though.
It occurs to me that
fgl
may also be a good place to look for benchmarks.Maybe. I don't see any benchmarks in the repo though.
rdf
relies on fgl
and has benchmarks… might be worth a spin…
on using fgl via rdf: yes - but in their current state, rdf benchmarks take ages, because of[500,1000..100000]
in https://github.com/TravisWhitaker/rdf/blob/master/bench/Main.hs#L42 .
I replaced this list with map (10^) [3..5]
and got
Tue Jan 21 11:37 2020 Time and Allocation Profiling Report (Final)
bench-rdf-prof +RTS -p -s -h -RTS
total time = 59.60 secs (59600 ticks @ 1000 us, 1 processor)
total alloc = 75,388,764,824 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
encodeIRI Data.RDF.Encoder.Common src/Data/RDF/Encoder/Common.hs:(55,1)-(63,25) 15.3 19.5
parseNQuads Data.RDF.Parser.NQuads src/Data/RDF/Parser/NQuads.hs:(66,1)-(68,22) 12.3 15.6
option Data.Attoparsec.Combinator Data/Attoparsec/Combinator.hs:90:1-25 11.5 10.1
maybeBuilder Data.RDF.Encoder.Common src/Data/RDF/Encoder/Common.hs:(46,1)-(49,58) 9.5 4.6
parseHost Data.RDF.Internal src/Data/RDF/Internal.hs:(311,1)-(312,56) 7.5 5.4
encodeEscapedIRI Data.RDF.Encoder.Common src/Data/RDF/Encoder/Common.hs:(51,1)-(53,21) 7.0 3.4
parseLiteralBody Data.RDF.Internal src/Data/RDF/Internal.hs:(379,1)-(399,26) 6.2 10.5
encodeLiteral Data.RDF.Encoder.Common src/Data/RDF/Encoder/Common.hs:(72,1)-(77,28) 4.6 3.5
parseBlankNodeLabel Data.RDF.Internal src/Data/RDF/Internal.hs:(362,1)-(371,31) 2.8 2.5
parseScheme Data.RDF.Internal src/Data/RDF/Internal.hs:(288,1)-(295,33) 2.7 2.1
encodeQuad Data.RDF.Encoder.NQuads src/Data/RDF/Encoder/NQuads.hs:(43,1)-(53,49) 2.6 3.4
encodeBlankNode Data.RDF.Encoder.Common src/Data/RDF/Encoder/Common.hs:(87,1)-(89,24) 2.2 1.4
nf Criterion.Measurement.Types src/Criterion/Measurement/Types.hs:(272,1)-(275,27) 2.1 0.9
parseSubject Data.RDF.Internal src/Data/RDF/Internal.hs:(338,1)-(342,68) 1.8 0.7
eitherResult Data.Attoparsec.Text.Lazy Data/Attoparsec/Text/Lazy.hs:(99,1)-(101,77) 1.8 4.3
parseEscapedIRI Data.RDF.Internal src/Data/RDF/Internal.hs:358:1-54 1.6 1.4
encodeRDFGraph Data.RDF.Encoder.NQuads src/Data/RDF/Encoder/NQuads.hs:(55,1)-(59,50) 1.1 1.4
parse Data.Attoparsec.Text.Lazy Data/Attoparsec/Text/Lazy.hs:(79,1)-(88,58) 0.8 1.1
I am not seeing any IntMap (but some HashMap).
The profiles are attached:
I may be misunderstanding something about how the profiling works, but I'm very concerned to see boundKey
and unbox
in there. Is the profiling adding some overhead to those functions and/or stopping them from being inlined? They should literally not exist: boundKey
is unwrapping a newtype (i.e. doing nothing in Core), and unbox
is only applied in cases where the key came from Node
or IntMap_
, where the integer was unboxed anyway and there is nothing to do. I've never seen them when looking at Core.
stopping them from being inlined
I just realized that neither of those functions has INLINE
explicitly written out, and the profiling documentation says it will add cost centers to all functions not marked INLINE
.
The profiles are attached:
I may be misunderstanding something about how the profiling works, but I'm very concerned to see
boundKey
andunbox
in there. Is the profiling adding some overhead to those functions and/or stopping them from being inlined? They should literally not exist:boundKey
is unwrapping a newtype (i.e. doing nothing in Core), andunbox
is only applied in cases where the key came fromNode
orIntMap_
, where the integer was unboxed anyway and there is nothing to do. I've never seen them when looking at Core.
Overly aggressive cost-center profiling (e.g., -fprof-auto
) will definitely hurt performance. -fprof-auto-exported
is likely better, but make sure the box/unbox primitives are in a module not compiled like that. The safest bet is to add all the cost centers manually.
After slapping some INLINE
s and -fprof-auto-exported
around (and running into the fact that you can't INLINE
field accessors like boundKey
), I got:
Now, I'm not sure I got everything right and I wasn't running on a particularly quiet machine, but at least I didn't see any boundKey
, unbox
, or xorCache
s floating around. Overall, I'm not too surprised by the results:
unionWith
(percentage is the same, but unionDisjoint
also shows up a bit). This is a merge-heavy workload, it seems, and union
has the last known majorly regressing cases, hence the "optimize union
more" item on the TODO list for this issue.IntMap
allocation is way down. This is as it should be, since the structure is more compact. It really makes the GHC test failures that much more confusing, though.Re: -fprof-auto
I think there are two ways of using profiling info:
rdf
use IntMap
at all). For that, anything goes?Manually adding cost centers seems quite heavy. Statistical profiling would work without instrumentation? But what's the status? https://gitlab.haskell.org/ghc/ghc/wikis/dwarf/status#statistical-profiling
Diff Haddocks to make sure no functions or instances have gone missing.
I just finished this myself and patched up the last holes that I found. Could someone else double check?
This ticket summarizes things that should be done either before or soon after the enormous
IntMap
replacement in #340.Before merge
[x] Reinstate the export list in
Data.IntMap.Internal
.[x] Add an export list to
Data.IntMap.Merge.Internal
.[ ] Diff Haddocks to make sure no functions or instances have gone missing.
[ ] Generate a final benchmark comparison. I prefer to see these interleaved or side by side rather than one whole set after another, but I won't insist on it. This should go in the commit message for the final squashed commit.
[ ] Try a bit harder to optimize
union
.[x] See what non-merge functions can be defined in terms of merge tactics and
runMissingAll
. (More would require a restructuring of the module boundaries.)[x] Improve test coverage, especially of merge tactics and
isSubmapOf
. (Side question: should we bump up the number of test cases Travis asks for?)[x] Add internal documentation for merges and deletion.
[x] Reinstate rewrite rules and the phased
INLINE
andNOINLINE
directives needed to make them work. Many of them (e.g., map fusion rules and map/coerce) are unconditional improvements. Others (relating to list conversions) are likely desirable.[ ] Check that GHC can use the new IntMap without (undue) performance degradation. (@sjakobi has offered to trigger the build jobs for this.)
[x] See if anything can be done about the object code duplication resulting from the alternating levels. Is there some clever way to make them uniform enough to share code? (Sadly, the answer appears to be no, for reasons that are not entirely obvious. Flipping over to the
complement
of the key each time we go left seems to hurt lookups significantly, although not enormously.)After merge
[ ] See if anything can be done about the source code duplication resulting from the alternating levels. Is there some clever way to make them uniform enough to share code?
[ ] Rewrite
IntSet
to match. Hopefully this is fairly mechanical, replacing values by bitmaps. Once this is done, get rid of bit twiddling utilities that are no longer used.[ ] Pick up #653 and see how well it works with the new representation.
[ ] Optimize
merge
.[ ] Resolve the strictness guarantees of
fromAscList
and friends. See #473 and https://github.com/haskell/containers/pull/340#issuecomment-247811863.[ ] Consider using more type wrappers for safety.
[ ] Evaluate using unboxed sums for intermediates. These are not always a win, but sometimes are.
[ ] Expand the benchmarks to cover more functions. Check for major regressions that the current benchmarks missed.
[ ] Consider adding as permanent APIs things that came up in the implementation:
runMissingAll :: WhenMissing f a b -> IntMap a -> f (IntMap b)
, plus maybe a pure variant. (DF: I'm not so convinced yet. What's it for?)traverseMaybeWithKey :: (Key -> a -> f (Maybe b)) -> IntMap a -> f (IntMap b)
(DF: yes, we most definitely want this by some name or other.)dropMatched :: WhenMatched f a b c
firstMatched :: WhenMatched f a b a
(andsecondMatched :: WhenMatched f a b b
)fold
variants that can ignore or be otherwise loose with ordering. (see https://github.com/haskell/containers/issues/698#issuecomment-569558952, but more generally)Uncategorized
[ ] Check whether the restructuring in #658 is a good idea for the new representation and, if so, reinstate it.
[ ] Continue to add internal documentation. For example,
isSubmapOf
and the like are pretty confusing.[ ] Shorten long lines when reasonable. We generally aim for 80 characters or fewer. There's flexibility for unusual situations, but this code has some very long lines that could be broken cleanly.
[x] Use
liftA*
inmergeA
.[ ] Evaluate whether we can get the same performance using
foldMapDefault
,fmapDefault
, and equivalent tricks for keyed versions. That would cut down on source code.[ ] Evaluate whether we should implement
alter
"by hand" as we used to. If not, then we should probably implement it usingalterF @Identity
.