rust-lang / hashbrown

Rust port of Google's SwissTable hash map
https://rust-lang.github.io/hashbrown
Apache License 2.0
2.42k stars 283 forks source link

Was swap-remove behavior ever considered when removing entries? #503

Open matthieu-m opened 8 months ago

matthieu-m commented 8 months ago

Motivation

One of the scenarios were Swiss Tables tend NOT to fare too well in are scenarios involving deletions.

A Swiss Table in which entries are only inserted -- a fairly common usecase, for example dumping all the HTTP parameters or headers of a an incoming request -- will have contiguous entries in each group of 16. This leads to good cache line density, and helps in reducing cache misses.

On the other hand, once an entry is deleted, a hole is left in the run of entries of the group it was deleted from -- unless it was last, of course -- and in the presence of a good hash, further insertions into the table will likely end up in different groups for a while.

This leads to tables in which a significant number of deletions have been performed looking like... Swiss Cheese. Appropriate, I suppose, but not cache-friendly.

Backward Shifting Deletion

Prior to hashbrown, the standard hashmap in Rust used Robin Hood with Backward Shifting Deletion: upon deletion, any entry immediately following the deleted entry (recursively) which had been displaced due to the now deleted entry, was shifted back. This helped avoiding tombstones in the probe sequence, and led to good cache density.

The cost of shifting all those entries back is however none too good. Deletion is not really O(1) any longer when an unspecified number of elements may need to be shifted back.

So I would not recommend backward shifting deletion...

Swap Remove

But what about swap_remove?

The idea would be that if the entry deleted from a group is not the last entry of the group, then that last entry is swapped in the now available spot. Just like Vec::swap_remove this is O(1), while preserving the contiguity of the entries in the group.

This is a slight additional cost for deletion, in exchange for better cache density for the entire period of time following the deletion up until a new element is inserted back that would have plugged the hole.

Is it worth it?

I don't know! I've never heard of it being used.

Before I try to implement the behavior, however, I thought I'd ask if:

  1. I misread the code, and it's actually already implemented.
  2. This idea was already considered, and discarded for some reason.
  3. Whether it seems worthy of consideration, or I'm heading into a hopeless direction...

I am quite keen on hearing folks thoughts on the matter.

matthieu-m commented 8 months ago

Comment from joaquintides, general hash map guru, and more recently one of the authors of boost::unordered_flat_map:

I haven't observed that situation you describe. Take into account however that "holes" will be reused upon new insertions, so it's not like they keep growing indefinitely. Also, in Abseil the natural distribution of occupied slots and holes is quite random to begin with, see slide 29 of

https://raw.githubusercontent.com/joaquintides/usingstdcpp2023/main/More%20than%20a%20rehash.pdf

boost::unordered_flat_map, otoh, tends to have clustered elements (which is good for cache locality), so relocation upon erasure could be beneficial, but that would impose an extra overhead on erase and invalidate iterators to the relocated elements. Some hash tables do this, for instance ankerl::unordered_dense::map.

So, as expected, my idea isn't original, though it could potentially be beneficial.

Amanieu commented 8 months ago

I'm hesitant to add this because it adds an extra cost when removing an element, and the benefit is not obvious. I would be more inclined to accept this if you have a benchmark that shows a clear performance benefit.

matthieu-m commented 8 months ago

I submitted https://github.com/rust-lang/hashbrown/pull/507 as an attempt to measure the performance loss from leaving holes behind.

I can't be certain that the benchmark is doing the right thing. In particular, I attempted to construct benchmarks which precisely place elements right where I'd want them for maximum effect by exploiting the particulars of the implementation, and it's a bit complicated to judge whether I was successful.

The benchmark results, on my machine (11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz) are as follow:

test swiss_cheese_empty_cluster_l2_quarter          ... bench:     107,253 ns/iter (+/- 2,729)
test swiss_cheese_empty_random_l2_quarter           ... bench:     109,898 ns/iter (+/- 6,124)
test swiss_cheese_lone_cluster_l2_quarter           ... bench:   4,141,660 ns/iter (+/- 245,388)
test swiss_cheese_lone_random_l2_quarter            ... bench:   4,106,439 ns/iter (+/- 63,492)
test swiss_cheese_eigth_cluster_l2_quarter          ... bench:   8,329,706 ns/iter (+/- 603,492)
test swiss_cheese_eigth_random_l2_quarter           ... bench:   8,366,143 ns/iter (+/- 841,366)
test swiss_cheese_quarter_cluster_l2_quarter        ... bench:  17,091,345 ns/iter (+/- 851,073)
test swiss_cheese_quarter_random_l2_quarter         ... bench:  16,941,942 ns/iter (+/- 734,029)
test swiss_cheese_half_cluster_l2_quarter           ... bench:  33,936,353 ns/iter (+/- 2,089,683)
test swiss_cheese_half_random_l2_quarter            ... bench:  33,779,245 ns/iter (+/- 901,914)
test swiss_cheese_full_cluster_l2_quarter           ... bench:  66,712,969 ns/iter (+/- 2,283,823)
test swiss_cheese_full_random_l2_quarter            ... bench:  67,051,106 ns/iter (+/- 1,346,535)

test swiss_cheese_empty_cluster_l2_three_quarters   ... bench:      35,830 ns/iter (+/- 1,213)
test swiss_cheese_empty_random_l2_three_quarters    ... bench:      35,789 ns/iter (+/- 1,370)
test swiss_cheese_lone_cluster_l2_three_quarters    ... bench:     889,015 ns/iter (+/- 32,123)
test swiss_cheese_lone_random_l2_three_quarters     ... bench:     889,737 ns/iter (+/- 70,205)
test swiss_cheese_eigth_cluster_l2_three_quarters   ... bench:   1,811,351 ns/iter (+/- 176,593)
test swiss_cheese_eigth_random_l2_three_quarters    ... bench:   1,813,937 ns/iter (+/- 124,895)
test swiss_cheese_quarter_cluster_l2_three_quarters ... bench:   3,679,451 ns/iter (+/- 225,465)
test swiss_cheese_quarter_random_l2_three_quarters  ... bench:   3,664,752 ns/iter (+/- 118,518)
test swiss_cheese_half_cluster_l2_three_quarters    ... bench:   7,550,932 ns/iter (+/- 707,679)
test swiss_cheese_half_random_l2_three_quarters     ... bench:   7,335,543 ns/iter (+/- 453,063)
test swiss_cheese_full_cluster_l2_three_quarters    ... bench:  14,246,891 ns/iter (+/- 464,697)
test swiss_cheese_full_random_l2_three_quarters     ... bench:  14,968,833 ns/iter (+/- 724,493)

test swiss_cheese_empty_cluster_l2_single           ... bench:      26,941 ns/iter (+/- 1,121)
test swiss_cheese_empty_random_l2_single            ... bench:      26,936 ns/iter (+/- 1,546)
test swiss_cheese_lone_cluster_l2_single            ... bench:     575,818 ns/iter (+/- 24,934)
test swiss_cheese_lone_random_l2_single             ... bench:     576,248 ns/iter (+/- 29,015)
test swiss_cheese_eigth_cluster_l2_single           ... bench:   1,201,152 ns/iter (+/- 98,435)
test swiss_cheese_eigth_random_l2_single            ... bench:   1,222,738 ns/iter (+/- 145,428)
test swiss_cheese_quarter_cluster_l2_single         ... bench:   2,464,260 ns/iter (+/- 191,353)
test swiss_cheese_quarter_random_l2_single          ... bench:   2,402,873 ns/iter (+/- 135,769)
test swiss_cheese_half_cluster_l2_single            ... bench:   4,849,259 ns/iter (+/- 534,081)
test swiss_cheese_half_random_l2_single             ... bench:   4,831,783 ns/iter (+/- 449,949)
test swiss_cheese_full_cluster_l2_single            ... bench:   9,451,680 ns/iter (+/- 457,181)
test swiss_cheese_full_random_l2_single             ... bench:   9,500,475 ns/iter (+/- 372,736)

test swiss_cheese_empty_cluster_l2_double           ... bench:      13,509 ns/iter (+/- 362)
test swiss_cheese_empty_random_l2_double            ... bench:      13,595 ns/iter (+/- 832)
test swiss_cheese_lone_cluster_l2_double            ... bench:     216,011 ns/iter (+/- 10,948)
test swiss_cheese_lone_random_l2_double             ... bench:     216,045 ns/iter (+/- 16,825)
test swiss_cheese_eigth_cluster_l2_double           ... bench:     495,331 ns/iter (+/- 58,196)
test swiss_cheese_eigth_random_l2_double            ... bench:     448,442 ns/iter (+/- 25,042)
test swiss_cheese_quarter_cluster_l2_double         ... bench:     959,318 ns/iter (+/- 68,571)
test swiss_cheese_quarter_random_l2_double          ... bench:     941,993 ns/iter (+/- 57,970)
test swiss_cheese_half_cluster_l2_double            ... bench:   1,870,896 ns/iter (+/- 114,250)
test swiss_cheese_half_random_l2_double             ... bench:   1,883,339 ns/iter (+/- 192,821)
test swiss_cheese_full_cluster_l2_double            ... bench:   3,766,748 ns/iter (+/- 106,637)
test swiss_cheese_full_random_l2_double             ... bench:   3,756,684 ns/iter (+/- 228,687)

Where:

Due to the structure of the benchmark, there's simply more elements as the element size is lower, in order to get to 16 times the L2 cache size, so benchmarks cannot be compared between element sizes.

Within an element size, however, the trend is fairly clear:

The fact that cluster yields no significant advantage over random indicates, to me, that there doesn't seem to be much of an impact from cache misses here.

I would appreciate a review of the strategy and implementation, if you have time.

Unless I made a mistake -- which is not all that unlikely -- there doesn't seem to be any reason to try to improve locality of elements within a group any further, however, and thus this issue is moot.

Amanieu commented 7 months ago

I don't expect cache locality to have much of an impact in practice because the access pattern of hash table lookup is effectively random due to the hash function. Additionally, due to the way we scan control bytes with SIMD instructions, we will immediately "seek" to the first entry with a matching H2 hash (matching top 7 bits of the hash value) without looking at other elements. This means that unless we get a false positive on the H2 value (<1% chance per lookup) then the layout of elements within a group is irrelevant because we are only ever accessing a single element in that group.

The only case where this could conceivably make a difference might be iterating over all the elements of a hash table, but this is a relatively rare use case that we shouldn't be optimizing for: removing an entry is a much more common operation than iteration.

Amanieu commented 7 months ago

https://raw.githubusercontent.com/joaquintides/usingstdcpp2023/main/More%20than%20a%20rehash.pdf

It may actually be worth looking into the design of boost's new hash table to see if it's worth adopting it instead of SwissTable for hashbrown. They claim that it's faster in benchmarks.

matthieu-m commented 7 months ago

The only case where this could conceivably make a difference might be iterating over all the elements of a hash table, but this is a relatively rare use case that we shouldn't be optimizing for: removing an entry is a much more common operation than iteration.

I don't think iteration should be prioritized either.

I don't expect cache locality to have much of an impact in practice because the access pattern of hash table lookup is effectively random due to the hash function.

I disagree with the assessment.

In essence, we can think of hash-table access as randomly accessing elements of an array. Let's imagine two access patterns:

Both access patterns access the same number of elements, but the first accesses twice as less cache-lines as the second.

If the access are infrequent, then it shouldn't matter indeed: whatever cache-line was cached is most likely long gone. If accesses are very frequent, I would expect the cache lines to be cached, unless there's too many of them for the cache, and thus I'd expect an access pattern which minimizes the number of accessed cache lines to also minimize the number of cache misses.

Benchmarks seem to disprove the effect -- so either I'm bad at intuiting, or I'm bad at designing benchmarks.

This means that unless we get a false positive on the H2 value (<1% chance per lookup) then the layout of elements within a group is irrelevant because we are only ever accessing a single element in that group.

Despite being random, the accesses are clustered in time, which should lead to cache hits for a certain proportion of elements.

Additionally, due to the way we scan control bytes with SIMD instructions, we will immediately "seek" to the first entry with a matching H2 hash (matching top 7 bits of the hash value) without looking at other elements.

The effect of H2 is quite different depending on scenarios.

In particular, the scenario benchmarked in the above PR was for known-present elements, with distinct H2 within a group, in which case the only benefit of H2 is in helping locating exact offset of the element. No benefit for "skipping" elements -- as found in looking up absent elements, or "colliding" present elements -- should be seen in such a case.

matthieu-m commented 7 months ago

https://raw.githubusercontent.com/joaquintides/usingstdcpp2023/main/More%20than%20a%20rehash.pdf

It may actually be worth looking into the design of boost's new hash table to see if it's worth adopting it instead of SwissTable for hashbrown. They claim that it's faster in benchmarks.

There are two key differences, from what I can tell, which I would guess are independent from one another. See slide 24 (Mechanics):

  1. Instead of using a residual (h2) in the [0, 127] range, they use one in the [2, 255] range, nearly doubling the residual value space, and thus reducing the collision rate by a factor of about x2.
  2. They use groups of 15 elements, instead of 16, and use the leading metadata byte as a bit-indexed overflow marker. The advantage of the overflow marker is that it may allow to "abort" the quadratic probing sequence even if all other slots of the groups are non-empty, and the advantage of the bit-index overflow marker is that it maintains 8 markers in 1 byte, so that a lone element overflowing only leads to pursuing the quadratic probing sequence in 1/8th of the look-ups.

I have no idea which of the two differences could have the bigger impact.

With that said, the first is easy to try, the second, not so much.


The logic for the first can be found at https://github.com/boostorg/unordered/blob/develop/include/boost/unordered/detail/foa/core.hpp#L335-L390 for SSE2, and a bit further down for Neon.

In short:

The 8 seems fairly arbitrary, though I do note that it means that for the bit-marker of overflow choosing a multiple of 8 maintains the "group" the residual would fall in, so that all 256 values are equally split across the 8 bits of the overflow marker.

ktprime commented 6 months ago

boost's flat hash map is quite faster than swiss table from more than 10 different third-party benchmarks. c++ hash benchmark.

matthieu-m commented 6 months ago

@ktprime Yes, unfortunately that's not that helpful.

Firstly, hashbrown is inspired by Swiss Table, but not an exact port as far as I can tell. This means that while at a high-level it roughly follows Swiss Table, in a benchmark its performance is not necessarily quite the same, so the benchmarks provided against the C++ implementation of Swiss Table are not an exact match.

Still, I've been investigating the most obvious differences (254 values residuals and overflow tracking) in an attempt to check whether they could be improving performance. Overall, though, the patches are not that great performance wise. It's not that unexpected, as both only offer a performance boost is relatively rare situations, while having a potential cost for every element.

Unfortunately, my rough attempts did not show any great improvement. Performance is roughly on par, with often a more regressions than improvements. This suggests that while in theory they could improve performance, in practice they may require significant micro-optimization efforts, or more large-scale integration changes, for those improvements to kick in. But it's hard to guess which path would be more fruitful, and costly (time-wise) to investigate them all.

Still, the PRs are out there for anyone who wishes to investigate them further:

ktprime commented 6 months ago

I agree with you. I have also ported a Swiss table implementation, which has slightly slower performance compared to the Boost version.

Most benchmarks consist of simple loops that focus on testing find, hit/miss, and insertion operations. However, in real-world code, the scenario is different, especially when metadata is frequently accessed and needs to be kept hot in the cache. In such cases, I prefer to use a flat hash map where the key-value pair and metadata are located in the same slot, even if its performance may not be as good as the benchmark results suggest.

Based on my benchmark results, if the data is fully located in the CPU cache or if the metadata cannot fully reside in the cache (e.g., when the size of the hash map is less than 10,000 or greater than 500,000,000), a hash map based on SIMD does not have any priority over other hash map designs.