Closed meooow25 closed 3 months ago
This is most impressive. I will do my best to review it promptly. In the mean time, I have a big favor to ask: I was struggling the other day to merge the branch for version 0.7 into master. Do you think you could try to figure out how to do that right?
(prefix + mask) is just a part of the key with an additional mask bit set, could this be abused for performance?
With a key and a compatible prefix, xor
only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0
is right).
1800 | 0000011100001000
xor 1807 | 0000011100001111
---------+-----------------
| 0000000000000111
1800 | 0000011100001000
xor 1795 | 0000011100001000
---------+-----------------
| 0000000000001011
So the two comparison patterns could be reduced to the following:
-- Single-key operation with eager failure
trimatch p k =
let o = xor p k
in case () of
() | o .&. p == 0 = goRight
| unsafeShiftR o 1 .&. p == 0 = goLeft
| otherwise = goFail
-- Merge
quadbranch p1 p2 =
let o = xor p k `unsafeShiftR` 1
in case () of
() | p1 == p2 = goEqual
| o .&. p1 == 0 = goBelowP1
| o .&. p2 == 0 = goBelowP2
| otherwise = goDiffer
@treeowl I assume you mean #994. I don't know much about the CI setup but I'll take a look.
With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).
@BurningWitness it is unclear to me how that would work. For instance, xor p k .&. p == 0
, which is the same as k .&. p == p
, only tells us whether p
's 1-bits agree with k
. But we also need to know if it the 0-bits agree.
Probably worth thinking about it though, it would be great to find a more efficient set of bitwise ops to do the job.
Oh, true, xor
only has that guarantee if prefix matches the key, so I was wrong.
Oh, true,
xor
only has that guarantee if prefix matches the key, so I was wrong.
Ah well if the prefix already matches, the branch into left or right is done with one comparison so it's hard to imagine something faster than that.
What I'd like is to find a better way for the nomatch
stuff.
I believe nomatch
should be pretty much as fast as the old one now! (thanks gcc and godbolt)
Previous: i .&. (m `xor` (-m)) == p
— 1 and, 1 xor, 1 negate, 1 compare
New: (i `xor` p) .&. (p `xor` (-p))
— 1 and, 2 xor, 1 negate
Correct me if I'm wrong, I know little about things at this low a level.
Now, if only shorter
could be faster or avoided in some manner...
I suspect that's the only thing causing minor slowdowns for the intersection_disj
case.
I doubt the comparison costs significantly over the test, but benchmarking is the way to know for sure.
Okay I found a better way to branch that avoids shorter
. Almost every benchmark is faster or the same now! 🎉
@treeowl would you like to review? I think it's in a good shape now.
It would be nice to wrap p .|. (p - 1)
and p .&. (p - 1)
into functions named upper
and lower
respectively for clarity.
Also this should mean precise lookups can be expressed as
trimatch p k =
if k < p
then if k >= lower p
then goLeft
else goFail
else if k <= upper p
then goRight
else goFail
(I think this is faster than the algorithm listed in the paper, four operations in any direction instead of four for failure plus one for picking a side)
@BurningWitness
It would be nice to wrap
p .|. (p - 1)
andp .&. (p - 1)
into functions namedupper
andlower
respectively for clarity.
If that becomes the predominant way of viewing Prefix
, then perhaps. Currently we also use nomatch
so that is not true.
Also this should mean precise lookups can be expressed as...
I like that, since it uses the same view as mapMapBranch
, but it makes no difference on benchmarks for me. A small concern (maybe?) is that this duplicates the failure branch.
So we could switch to it, but I don't really see a good reason to do so. If it is demonstrated to be better in some way, then sure I would favor it.
A small concern (maybe?) is that this duplicates the failure branch.
Indeed that is the case, the failure branch needs to be let-bound with a NOINLINE
to avoid it. Probably won't be visible on the benchmarks this way either though.
I made a local benchmark on a Word
-based tree clone and indeed the nomatch
solution is ahead performance-wise by a remarkably narrow margin, 1-2% difference.
Hi @treeowl, please review when you get the time.
I'm ready to merge, but I want to make sure the introduction of the helpers in your last commit didn't hurt performance. Did you check the performance effect of that, or (alternatively) verify that the Core, unfolding, and unfolding guidance are the same?
Awesome 🎉
I peeked at the Core for union
and it seemed alright. But I will run the benchmarks a final time. I also want to go over the PR once to check for bugs, since it touches so many functions. I will update when I'm done, should be some time later today.
Ok I think this is good to go.
Updated the benchmarks. I found some odd increases in a few intersection
tests that seemed to have originated with 82f4852, a completely unrelated change. Adding a -fproc-alignment=64
flag (as suggested by tasty-bench
) fixed that.
I also added the "-with-rtsopts=-A32m"
flag from tasty-bench while at it. This greatly reduced the Copied
stats however, for both before and after. I think that column is less indicative of anything now, but I've kept it for the sake of it.
Let me know if things look good and I'll squash it into a few commits (benchmark flag, tests, main IntMap changes).
Why would you set -A32m
here? That seems to change what the benchmarks are measuring quite a lot. The alignment thing seems reasonable, although it's unfortunate we have to do that.
tasty-bench
claims it makes the individual benchmarks less likely to affect one another. I thought that would be good to have as a general improvement (and not to address by any observed issue).
Let's wait with that one for now. I'd like to hear opinions from others on a separate PR.
Sounds good. Removed the flag and re-ran the benchmarks.
Aaaaand we're merged! Fantastic work!
Thanks for reviewing! Next up... IntSet
😄
Replaces the separate
Prefix
andMask
fields in theBin
constructor with a single field that contains both.This reduces the memory required by a
Bin
from 5 to 4 words, at the cost of more computations (which are cheap bitwise ops) being necessary for certain operations.Implements #991 for
IntMap
only. The same can be done forIntSet
if this is accepted.Memory
Concretely, this reduces the memory required by an
IntMap
by ~12.5% (including the keys and values).Calculations: For a map with
n
key-values, there aren
Tip
s costing 3 words each andn-1
Bin
s costing 5 words each before this change and 4 words after. So we save about 1 out of 8 words per key-value.Compatibility
This PR, in its current state, makes breaking changes to IntMap internals which is reflected in the exports of the internal module
Data.IntMap.Internal
. There is no change in the exports of any other module.Benchmarks
Benchmarks done with GHC 9.6.3. Last updated: 7e2ca1c The output is from a hacky script I put together that compares tasty-bench's csv files.
Benchmark command:
cabal run <target> -- --csv <csv> +RTS -T
intmap-benchmarks
set-operations-intmap
Allocations have gone down in every benchmark, which is expected but quite pleasant to see. Runtime has gone down in most benchmarks and gone up in a few.
Notable time regressions
The only significant regressions (>10%) I see are in the
union-disj
andintersection-disj
set operations. However I don't consider this a problem because these operations on the specific disjoint maps used in the tests wrap up in $O(W)$, due to one the left map being[1..n]
and the right being[n+1...n+x]
. This is measuring the cost of one set of new bitwise ops vs old bitwise ops, which is unsurprisingly a little slower.Edit (58cbc6a):
union-disj
is better now. Edit (4b708ba):intersection-disj
is comparable now.Let me know if anything else stands out as troublesome.