haskell / containers

Assorted concrete container types
https://hackage.haskell.org/package/containers
314 stars 177 forks source link

Optimize IntMap.Bin #995

Closed meooow25 closed 3 months ago

meooow25 commented 4 months ago

Replaces the separate Prefix and Mask fields in the Bin constructor with a single field that contains both.

This reduces the memory required by a Bin from 5 to 4 words, at the cost of more computations (which are cheap bitwise ops) being necessary for certain operations.

Implements #991 for IntMap only. The same can be done for IntSet if this is accepted.

Memory

Concretely, this reduces the memory required by an IntMap by ~12.5% (including the keys and values).

Calculations: For a map with n key-values, there are n Tips costing 3 words each and n-1 Bins costing 5 words each before this change and 4 words after. So we save about 1 out of 8 words per key-value.

Compatibility

This PR, in its current state, makes breaking changes to IntMap internals which is reflected in the exports of the internal module Data.IntMap.Internal. There is no change in the exports of any other module.

Benchmarks

Benchmarks done with GHC 9.6.3. Last updated: 7e2ca1c The output is from a hacky script I put together that compares tasty-bench's csv files.

Benchmark command: cabal run <target> -- --csv <csv> +RTS -T

intmap-benchmarks

Name                          Time - - - - - - - -    Allocated - - - - -     Copied - - - - - - -
                                   A       B     %         A       B     %         A       B     %
alter                         472 μs  421 μs  -10%    2.2 MB  1.8 MB  -18%     71 KB   50 KB  -30%
delete                        103 μs   94 μs   -9%    956 KB  763 KB  -20%    150 B   116 B   -22%
foldlWithKey                  434 μs  438 μs   +0%    2.1 MB  1.7 MB  -19%     64 KB   49 KB  -23%
foldlWithKey'                  12 μs   12 μs   -2%      0 B     0 B             0 B     0 B
foldrWithKey                   23 ns   24 ns   +3%    463 B   463 B    +0%      0 B     0 B
fromAscList                    49 μs   46 μs   -6%    256 KB  223 KB  -12%     14 KB  6.1 KB  -56%
fromDistinctAscList            46 μs   43 μs   -5%    256 KB  223 KB  -12%     14 KB  6.1 KB  -56%
fromList                      114 μs   97 μs  -15%    1.0 MB  861 KB  -18%     36 KB   26 KB  -28%
insert                        111 μs   99 μs  -10%    1.0 MB  861 KB  -18%     36 KB   25 KB  -28%
insertLookupWithKey empty     506 μs  527 μs   +4%    4.0 MB  3.7 MB   -8%    153 KB  112 KB  -26%
insertLookupWithKey update    1.7 ms  1.7 ms   +2%    9.6 MB  8.8 MB   -8%    1.1 MB  1.0 MB   -9%
insertWith empty              116 μs  106 μs   -8%    1.0 MB  863 KB  -18%     36 KB   26 KB  -28%
insertWith update             483 μs  444 μs   -8%    2.3 MB  1.9 MB  -17%    133 KB   93 KB  -29%
insertWith' empty             123 μs  113 μs   -8%    1.0 MB  861 KB  -18%     36 KB   25 KB  -29%
insertWith' update            487 μs  445 μs   -8%    2.2 MB  1.8 MB  -17%     90 KB   65 KB  -28%
insertWithKey empty           118 μs  106 μs  -10%    1.0 MB  863 KB  -17%     36 KB   26 KB  -27%
insertWithKey update          477 μs  446 μs   -6%    2.3 MB  1.9 MB  -17%    120 KB   96 KB  -20%
insertWithKey' empty          124 μs  110 μs  -11%    1.0 MB  861 KB  -18%     36 KB   26 KB  -27%
insertWithKey' update         490 μs  447 μs   -8%    2.2 MB  1.8 MB  -17%     90 KB   67 KB  -24%
lookup_half                   164 μs  110 μs  -33%      0 B     0 B             0 B     0 B
lookup_hits                   165 μs  161 μs   -2%      0 B     0 B             0 B     0 B
lookup_misses                 164 μs  162 μs   -1%      0 B     0 B             0 B     0 B
lookup_mixed                   17 μs   22 μs  +30%      0 B     0 B             0 B     0 B
lookup_most                   157 μs  149 μs   -5%      0 B     0 B             0 B     0 B
map                            31 μs   29 μs   -5%    351 KB  319 KB   -9%     15 KB   13 KB  -16%
mapMaybe                       55 μs   56 μs   +2%    275 KB  255 KB   -6%    5.9 KB  4.9 KB  -16%
mapMaybeWithKey                54 μs   56 μs   +3%    277 KB  255 KB   -7%    5.8 KB  5.0 KB  -15%
mapWithKey                     37 μs   35 μs   -4%    414 KB  383 KB   -7%     20 KB   18 KB  -12%
minView                        45 ns   32 ns  -30%    119 B   103 B   -13%      0 B     0 B
spanAntitone                  112 ns  125 ns  +11%    749 B   655 B   -12%      0 B     0 B
split                          47 ns   45 ns   -3%    486 B   399 B   -17%      0 B     0 B
splitLookup                    50 ns   48 ns   -3%    511 B   422 B   -17%      0 B     0 B
update                        441 μs  420 μs   -4%    2.1 MB  1.7 MB  -17%     68 KB   48 KB  -28%
updateLookupWithKey           901 μs  826 μs   -8%    4.4 MB  3.6 MB  -16%    167 KB  100 KB  -40%

set-operations-intmap

Name                           Time - - - - - - - -    Allocated - - - - -     Copied - - - - - - -
                                    A       B     %         A       B     %         A       B     %
difference-block_nn             25 μs   22 μs  -10%     88 KB   71 KB  -19%    1.0 KB  666 B   -34%
difference-block_nn_swap        25 μs   22 μs   -9%     88 KB   71 KB  -19%    977 B   660 B   -32%
difference-block_ns            2.2 μs  2.0 μs   -8%     13 KB   11 KB  -20%     28 B    16 B   -42%
difference-block_sn_swap       1.9 μs  1.7 μs   -7%    8.8 KB  7.1 KB  -19%     12 B     8 B   -33%
difference-common_nn           1.1 ms  903 μs  -14%    1.9 MB  1.5 MB  -19%    416 KB  297 KB  -28%
difference-common_nn_swap      465 μs  437 μs   -6%      0 B     0 B             0 B     0 B
difference-common_ns           361 μs  313 μs  -13%    1.2 MB  941 KB  -21%    173 KB  107 KB  -38%
difference-common_nt            33 μs   28 μs  -13%    102 KB   81 KB  -20%    1.3 KB  841 B   -36%
difference-common_sn_swap      153 μs  134 μs  -12%      0 B     0 B             0 B     0 B
difference-common_tn_swap       14 μs   12 μs  -19%     44 B    54 B   +22%      0 B     0 B
difference-disj_nn              57 ns   55 ns   -3%    294 B   247 B   -15%      0 B     0 B
difference-disj_nn_swap         64 ns   64 ns   +0%    471 B   383 B   -18%      0 B     0 B
difference-disj_ns              53 ns   51 ns   -4%    294 B   247 B   -15%      0 B     0 B
difference-disj_nt              49 ns   45 ns   -7%    294 B   247 B   -15%      0 B     0 B
difference-disj_sn_swap         58 ns   56 ns   -2%    390 B   319 B   -18%      0 B     0 B
difference-disj_tn_swap         47 ns   47 ns   +0%    271 B   223 B   -17%      0 B     0 B
difference-mix_nn              6.1 ms  4.0 ms  -35%    6.1 MB  5.2 MB  -14%     10 MB  5.3 MB  -47%
difference-mix_nn_swap         5.8 ms  4.6 ms  -19%    6.0 MB  5.3 MB  -11%    9.6 MB  6.6 MB  -31%
difference-mix_ns              322 μs  279 μs  -13%    1.3 MB  1.0 MB  -20%    205 KB  138 KB  -32%
difference-mix_nt               24 μs   21 μs  -14%    102 KB   81 KB  -20%    1.3 KB  883 B   -34%
difference-mix_sn_swap         191 μs  175 μs   -8%    623 KB  543 KB  -12%     47 KB   36 KB  -23%
difference-mix_tn_swap          12 μs  9.2 μs  -23%     19 KB   17 KB  -12%     51 B    41 B   -19%
intersection-block_nn           13 μs   11 μs  -13%      0 B     0 B             0 B     0 B
intersection-block_nn_swap      13 μs   11 μs  -10%      0 B     0 B             0 B     0 B
intersection-block_ns          1.3 μs  1.3 μs   -4%     25 B     0 B   -100%      0 B     0 B
intersection-block_sn_swap     1.2 μs  1.2 μs   -1%      0 B     0 B             0 B     0 B
intersection-common_nn         916 μs  761 μs  -16%    1.9 MB  1.5 MB  -20%    440 KB  299 KB  -31%
intersection-common_nn_swap    880 μs  759 μs  -13%    1.9 MB  1.5 MB  -20%    439 KB  299 KB  -31%
intersection-common_ns         183 μs  176 μs   -3%    354 KB  278 KB  -21%     15 KB  9.9 KB  -32%
intersection-common_nt          13 μs   14 μs  +13%     12 KB  9.7 KB  -18%     22 B    13 B   -40%
intersection-common_sn_swap    186 μs  171 μs   -8%    351 KB  278 KB  -20%     15 KB  9.9 KB  -31%
intersection-common_tn_swap     13 μs   14 μs   +3%     12 KB  8.8 KB  -28%     21 B    13 B   -38%
intersection-disj_nn            40 ns   36 ns   -9%     31 B    31 B    +0%      0 B     0 B
intersection-disj_nn_swap       38 ns   40 ns   +4%     31 B    31 B    +0%      0 B     0 B
intersection-disj_ns            35 ns   33 ns   -5%     31 B    31 B    +0%      0 B     0 B
intersection-disj_nt            32 ns   31 ns   -2%     31 B    31 B    +0%      0 B     0 B
intersection-disj_sn_swap       34 ns   37 ns   +7%     31 B    31 B    +0%      0 B     0 B
intersection-disj_tn_swap       31 ns   32 ns   +3%     31 B    31 B    +0%      0 B     0 B
intersection-mix_nn            857 μs  769 μs  -10%      0 B     0 B             0 B     0 B
intersection-mix_nn_swap       823 μs  794 μs   -3%      0 B     0 B             0 B     0 B
intersection-mix_ns            127 μs  117 μs   -7%      0 B     0 B             0 B     0 B
intersection-mix_nt            7.8 μs  6.3 μs  -19%      0 B    31 B             0 B     0 B
intersection-mix_sn_swap       121 μs  113 μs   -6%      0 B     0 B             0 B     0 B
intersection-mix_tn_swap       8.7 μs  7.7 μs  -11%      0 B     0 B             0 B     0 B
union-block_nn                  33 μs   29 μs  -12%    177 KB  142 KB  -19%    3.9 KB  2.5 KB  -35%
union-block_nn_swap             33 μs   28 μs  -13%    178 KB  142 KB  -20%    3.9 KB  2.5 KB  -35%
union-block_ns                 2.7 μs  2.4 μs  -10%     22 KB   18 KB  -20%     69 B    44 B   -36%
union-block_sn_swap            2.7 μs  2.5 μs  -10%     22 KB   18 KB  -20%     69 B    42 B   -39%
union-common_nn                1.6 ms  1.3 ms  -18%    3.8 MB  3.0 MB  -21%    1.8 MB  1.2 MB  -35%
union-common_nn_swap           1.6 ms  1.3 ms  -16%    3.8 MB  3.0 MB  -21%    1.7 MB  1.2 MB  -32%
union-common_ns                388 μs  353 μs   -8%    1.5 MB  1.2 MB  -19%    288 KB  189 KB  -34%
union-common_nt                 30 μs   27 μs  -11%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -36%
union-common_sn_swap           387 μs  338 μs  -12%    1.5 MB  1.2 MB  -20%    286 KB  188 KB  -34%
union-common_tn_swap            31 μs   25 μs  -17%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -36%
union-disj_nn                   80 ns   64 ns  -19%    750 B   607 B   -19%      0 B     0 B
union-disj_nn_swap              78 ns   73 ns   -6%    748 B   607 B   -18%      0 B     0 B
union-disj_ns                   72 ns   57 ns  -19%    671 B   543 B   -19%      0 B     0 B
union-disj_nt                   60 ns   48 ns  -19%    549 B   447 B   -18%      0 B     0 B
union-disj_sn_swap              76 ns   73 ns   -3%    671 B   541 B   -19%      0 B     0 B
union-disj_tn_swap              61 ns   56 ns   -8%    550 B   447 B   -18%      0 B     0 B
union-mix_nn                   9.2 ms  6.5 ms  -28%    7.6 MB  6.1 MB  -19%     17 MB  9.8 MB  -41%
union-mix_nn_swap              9.2 ms  6.3 ms  -31%    7.6 MB  6.0 MB  -20%     17 MB  9.3 MB  -44%
union-mix_ns                   431 μs  384 μs  -10%    1.7 MB  1.3 MB  -19%    352 KB  268 KB  -23%
union-mix_nt                    29 μs   24 μs  -18%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -37%
union-mix_sn_swap              423 μs  341 μs  -19%    1.7 MB  1.3 MB  -19%    352 KB  247 KB  -29%
union-mix_tn_swap               28 μs   23 μs  -16%    114 KB   91 KB  -20%    1.6 KB  1.0 KB  -35%

Allocations have gone down in every benchmark, which is expected but quite pleasant to see. Runtime has gone down in most benchmarks and gone up in a few.

Notable time regressions

The only significant regressions (>10%) I see are in the union-disj and intersection-disj set operations. However I don't consider this a problem because these operations on the specific disjoint maps used in the tests wrap up in $O(W)$, due to one the left map being [1..n] and the right being [n+1...n+x]. This is measuring the cost of one set of new bitwise ops vs old bitwise ops, which is unsurprisingly a little slower.
Edit (58cbc6a): union-disj is better now. Edit (4b708ba): intersection-disj is comparable now.

Let me know if anything else stands out as troublesome.

treeowl commented 4 months ago

This is most impressive. I will do my best to review it promptly. In the mean time, I have a big favor to ask: I was struggling the other day to merge the branch for version 0.7 into master. Do you think you could try to figure out how to do that right?

BurningWitness commented 4 months ago

(prefix + mask) is just a part of the key with an additional mask bit set, could this be abused for performance?

With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).

    1800 | 0000011100001000
xor 1807 | 0000011100001111
---------+-----------------
         | 0000000000000111

    1800 | 0000011100001000
xor 1795 | 0000011100001000
---------+-----------------
         | 0000000000001011

So the two comparison patterns could be reduced to the following:

-- Single-key operation with eager failure
trimatch p k =
  let o = xor p k
  in case () of
       () | o .&. p == 0                = goRight
          | unsafeShiftR o 1 .&. p == 0 = goLeft
          | otherwise                   = goFail

-- Merge
quadbranch p1 p2 =
  let o = xor p k `unsafeShiftR` 1
  in case () of
       () | p1 == p2      = goEqual
          | o .&. p1 == 0 = goBelowP1
          | o .&. p2 == 0 = goBelowP2
          | otherwise     = goDiffer
meooow25 commented 4 months ago

@treeowl I assume you mean #994. I don't know much about the CI setup but I'll take a look.

With a key and a compatible prefix, xor only yields a value at or below the level of the masking bit, whether or not the value is at the masking bit determines the direction (0 is right).

@BurningWitness it is unclear to me how that would work. For instance, xor p k .&. p == 0, which is the same as k .&. p == p, only tells us whether p's 1-bits agree with k. But we also need to know if it the 0-bits agree.
Probably worth thinking about it though, it would be great to find a more efficient set of bitwise ops to do the job.

BurningWitness commented 4 months ago

Oh, true, xor only has that guarantee if prefix matches the key, so I was wrong.

meooow25 commented 4 months ago

Oh, true, xor only has that guarantee if prefix matches the key, so I was wrong.

Ah well if the prefix already matches, the branch into left or right is done with one comparison so it's hard to imagine something faster than that.
What I'd like is to find a better way for the nomatch stuff.

meooow25 commented 4 months ago

I believe nomatch should be pretty much as fast as the old one now! (thanks gcc and godbolt)

Previous: i .&. (m `xor` (-m)) == p — 1 and, 1 xor, 1 negate, 1 compare New: (i `xor` p) .&. (p `xor` (-p)) — 1 and, 2 xor, 1 negate

Correct me if I'm wrong, I know little about things at this low a level.


Now, if only shorter could be faster or avoided in some manner... I suspect that's the only thing causing minor slowdowns for the intersection_disj case.

treeowl commented 4 months ago

I doubt the comparison costs significantly over the test, but benchmarking is the way to know for sure.

meooow25 commented 4 months ago

Okay I found a better way to branch that avoids shorter. Almost every benchmark is faster or the same now! 🎉

@treeowl would you like to review? I think it's in a good shape now.

BurningWitness commented 4 months ago

It would be nice to wrap p .|. (p - 1) and p .&. (p - 1) into functions named upper and lower respectively for clarity.


Also this should mean precise lookups can be expressed as

trimatch p k =
  if k < p
    then if k >= lower p
           then goLeft
           else goFail

    else if k <= upper p
           then goRight
           else goFail

(I think this is faster than the algorithm listed in the paper, four operations in any direction instead of four for failure plus one for picking a side)

meooow25 commented 4 months ago

@BurningWitness

It would be nice to wrap p .|. (p - 1) and p .&. (p - 1) into functions named upper and lower respectively for clarity.

If that becomes the predominant way of viewing Prefix, then perhaps. Currently we also use nomatch so that is not true.

Also this should mean precise lookups can be expressed as...

I like that, since it uses the same view as mapMapBranch, but it makes no difference on benchmarks for me. A small concern (maybe?) is that this duplicates the failure branch.
So we could switch to it, but I don't really see a good reason to do so. If it is demonstrated to be better in some way, then sure I would favor it.

BurningWitness commented 4 months ago

A small concern (maybe?) is that this duplicates the failure branch.

Indeed that is the case, the failure branch needs to be let-bound with a NOINLINE to avoid it. Probably won't be visible on the benchmarks this way either though.

BurningWitness commented 4 months ago

I made a local benchmark on a Word-based tree clone and indeed the nomatch solution is ahead performance-wise by a remarkably narrow margin, 1-2% difference.

meooow25 commented 3 months ago

Hi @treeowl, please review when you get the time.

treeowl commented 3 months ago

I'm ready to merge, but I want to make sure the introduction of the helpers in your last commit didn't hurt performance. Did you check the performance effect of that, or (alternatively) verify that the Core, unfolding, and unfolding guidance are the same?

meooow25 commented 3 months ago

Awesome 🎉 I peeked at the Core for union and it seemed alright. But I will run the benchmarks a final time. I also want to go over the PR once to check for bugs, since it touches so many functions. I will update when I'm done, should be some time later today.

meooow25 commented 3 months ago

Ok I think this is good to go.

Updated the benchmarks. I found some odd increases in a few intersection tests that seemed to have originated with 82f4852, a completely unrelated change. Adding a -fproc-alignment=64 flag (as suggested by tasty-bench) fixed that.

I also added the "-with-rtsopts=-A32m" flag from tasty-bench while at it. This greatly reduced the Copied stats however, for both before and after. I think that column is less indicative of anything now, but I've kept it for the sake of it.

Let me know if things look good and I'll squash it into a few commits (benchmark flag, tests, main IntMap changes).

treeowl commented 3 months ago

Why would you set -A32m here? That seems to change what the benchmarks are measuring quite a lot. The alignment thing seems reasonable, although it's unfortunate we have to do that.

meooow25 commented 3 months ago

tasty-bench claims it makes the individual benchmarks less likely to affect one another. I thought that would be good to have as a general improvement (and not to address by any observed issue).

treeowl commented 3 months ago

Let's wait with that one for now. I'd like to hear opinions from others on a separate PR.

meooow25 commented 3 months ago

Sounds good. Removed the flag and re-ran the benchmarks.

treeowl commented 3 months ago

Aaaaand we're merged! Fantastic work!

meooow25 commented 3 months ago

Thanks for reviewing! Next up... IntSet 😄