matthieu-m commented 6 months ago

Changes:

Use all values of h2, not just 130 of it.
Convert SSE2 implementation for benchmarking.
Add generic tests to verify that Group functions are correctly implemented.

Motivation:

Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%.

On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.

Limitations:

Only the SSE2 is ported at first, not the generic or neon ones, to gauge whether the performance looks worth it.

Design:

The values for EMPTY and DELETED are chosen so as to play well with SSE2, which does not have unsigned vectors. By using the top of the signed range, operations to distinguish between special and non-special are reduced to a single comparison, whereas using the middle of the range would require 2.

On the other hand, the convert_special_to_empty_and_full_to_deleted method is more complicated.

The results, on my machine, are not encouraging, but my machine is noisy so conducting proper benchmarking is tough.

matthieu-m commented 6 months ago

@Amanieu Would it be possible to benchmark on SSE2?

I'd like to see if it's worth it before trying to make the generic and neon targets pass (especially the generics one, all that bit fiddling is a bit involved).

Amanieu commented 6 months ago

Here are my benchmark results:

 name                         old.txt ns/iter  new.txt ns/iter  diff ns/iter   diff %  speedup 
 clone_from_large             5,064            5,170                     106    2.09%   x 0.98 
 clone_from_small             46               46                          0    0.00%   x 1.00 
 clone_large                  5,141            5,220                      79    1.54%   x 0.98 
 clone_small                  61               62                          1    1.64%   x 0.98 
 grow_insert_ahash_highbits   19,525           20,297                    772    3.95%   x 0.96 
 grow_insert_ahash_random     19,889           20,877                    988    4.97%   x 0.95 
 grow_insert_ahash_serial     19,530           20,794                  1,264    6.47%   x 0.94 
 grow_insert_std_highbits     36,006           36,842                    836    2.32%   x 0.98 
 grow_insert_std_random       35,998           36,841                    843    2.34%   x 0.98 
 grow_insert_std_serial       35,786           36,865                  1,079    3.02%   x 0.97 
 insert                       13,911           8,763                  -5,148  -37.01%   x 1.59 
 insert_ahash_highbits        17,971           18,048                     77    0.43%   x 1.00 
 insert_ahash_random          18,071           17,950                   -121   -0.67%   x 1.01 
 insert_ahash_serial          17,980           17,867                   -113   -0.63%   x 1.01 
 insert_erase_ahash_highbits  18,830           19,945                  1,115    5.92%   x 0.94 
 insert_erase_ahash_random    19,060           19,897                    837    4.39%   x 0.96 
 insert_erase_ahash_serial    18,426           19,221                    795    4.31%   x 0.96 
 insert_erase_std_highbits    32,468           33,627                  1,159    3.57%   x 0.97 
 insert_erase_std_random      33,186           34,270                  1,084    3.27%   x 0.97 
 insert_erase_std_serial      32,487           33,589                  1,102    3.39%   x 0.97 
 insert_std_highbits          22,173           22,509                    336    1.52%   x 0.99 
 insert_std_random            22,249           22,673                    424    1.91%   x 0.98 
 insert_std_serial            22,203           22,440                    237    1.07%   x 0.99 
 insert_unique_unchecked      5,300            5,614                     314    5.92%   x 0.94 
 iter_ahash_highbits          928              927                        -1   -0.11%   x 1.00 
 iter_ahash_random            932              926                        -6   -0.64%   x 1.01 
 iter_ahash_serial            932              916                       -16   -1.72%   x 1.02 
 iter_std_highbits            930              938                         8    0.86%   x 0.99 
 iter_std_random              931              912                       -19   -2.04%   x 1.02 
 iter_std_serial              946              920                       -26   -2.75%   x 1.03 
 lookup_ahash_highbits        3,938            3,989                      51    1.30%   x 0.99 
 lookup_ahash_random          4,039            4,208                     169    4.18%   x 0.96 
 lookup_ahash_serial          3,829            3,929                     100    2.61%   x 0.97 
 lookup_fail_ahash_highbits   3,267            3,389                     122    3.73%   x 0.96 
 lookup_fail_ahash_random     3,316            3,502                     186    5.61%   x 0.95 
 lookup_fail_ahash_serial     3,367            3,332                     -35   -1.04%   x 1.01 
 lookup_fail_std_highbits     9,224            9,561                     337    3.65%   x 0.96 
 lookup_fail_std_random       9,337            9,620                     283    3.03%   x 0.97 
 lookup_fail_std_serial       9,246            9,318                      72    0.78%   x 0.99 
 lookup_std_highbits          9,996            10,278                    282    2.82%   x 0.97 
 lookup_std_random            10,077           10,681                    604    5.99%   x 0.94 
 lookup_std_serial            10,002           10,569                    567    5.67%   x 0.95 
 rehash_in_place              180,307          187,097                 6,790    3.77%   x 0.96

Overall, this seems like a loss. My guess is that the extra logic needed when handling h2 values is slowing things down.

matthieu-m commented 6 months ago

One thing I wonder, is how many collisions there are in the first place.

Perfect hashes should be akin to uniformly spread values across the 0-127 range today. Given the general formula of the birthday paradox, we get that the probability of at least one collision across 16 elements is 16 15 / (2 128) = 0.9375.

Thus even with only 128 values to choose from the chance of two elements having the same residual h2 is pretty low in the first place. Even in a full group -- which won't be the case most of the time -- the probability of at least one collision is only 93.75%... and if there's a single collision, it means 14 elements (out of 16) have no collision.

This means paying for the extra complexity for every element, but rarely ever needing it.

I tried my best to keep the cost low, but computing h2 is slightly harder, and the "magic" remapping on rehash is definitely not optimal. Perhaps someone with better insight could pick better values, and better instructions.

JustForFun88 commented 5 months ago

@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7) is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.

In addition, I suggest also considering the following possible implementation godbolt:

const EMPTY: u8 = 0b1111_1111;   // 255
const DELETED: u8 = 0b1111_1110; // 254

pub fn h2(hash: u64) -> u8 {
    let bit = hash as u8;
    bit >> ((bit > 253) as u8)
}

matthieu-m commented 5 months ago

@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7) is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.

I realize that this implementation is also faulty, I was aiming for the top 8 bits, but it seems we only get the top 7 here.

In addition, I suggest also considering the following possible implementation godbolt:
const EMPTY: u8 = 0b1111_1111;   // 255
const DELETED: u8 = 0b1111_1110; // 254

#[no_mangle]
pub fn h2(hash: u64) -> u8 {
    let bit = hash as u8;
    bit >> ((bit > 253) as u8)
}

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

The trick to shift by 1 to avoid colliding with special values is interesting, and may save cycles compared to an array lookup.

JustForFun88 commented 5 months ago

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

JustForFun88 commented 5 months ago

@matthieu-m I carefully looked at the implementation and you are right, it is not possible to use values other than 127_i8 and 126_i8. In addition to your improvements, I also suggest changing the erase function:

const EMPTY: u8 = 0b0111_1111;   // 127
const DELETED: u8 = 0b0111_1110; // 126

impl RawTableInner {
    #[inline]
    unsafe fn erase(&mut self, index: usize) {
        debug_assert!(self.is_bucket_full(index));

        let index_before = index.wrapping_sub(Group::WIDTH) & self.bucket_mask;
        let empty_before = Group::load(self.ctrl(index_before)).match_empty();
        let empty_after = Group::load(self.ctrl(index)).match_empty();

        // Removing if
        let empty_group =
            (empty_before.leading_zeros() + empty_after.trailing_zeros() < Group::WIDTH) as u8;
        let ctrl = DELETED + empty_group;
        self.growth_left += empty_group as usize;

        self.set_ctrl(index, ctrl);
        self.items -= 1;
    }
}

It looks like it will improve removal performance: https://godbolt.org/z/5P9oTYe5z

matthieu-m commented 5 months ago

@JustForFun88 You seem to have many (good!) suggestions, and unfortunately I don't have much bandwidth right now.

I can probably get to them eventually (I may have some time in 2-3 weeks), but that's a bit of an eternity momentum-wise.

So, if I may suggest, why don't you checkout the branch, apply your suggested changes, and run the benchmarks? You'll see quickly if it works out or not, and you'll be able to try further ideas as well.

matthieu-m commented 5 months ago

This switches from the top 8 to the bottom 8 bits, and will overlap with h1. Is there a reason to prefer to the bottom 8 (and adjust h1) over the top 8?

To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

You're correct. h1 is used for indexing via % table length which takes the bottom bits, and Swiss Table (Abseil's implementation) uses the top bits for h2 (the residual).

Now, imagine a table of 256 slots:

index = h1 % 256 = hash % 256.
h2 = hash % 256.

And therein lies the rub. The goal of h2 is to provide a quick filter across the elements of a group of 16 (contiguous) elements: if you use h2 to decide which group the element goes in, then the h2s of the elements within the group are not uniformly distributed -- at all -- and thus it becomes a very poor filter, and performance will likely suffer.

Thus it's important to try and source h1 and h2 from different, non-overlapping bits, as much as possible. And taking top and bottom is the easiest way to do so. Which of the two takes top and which takes bottom shouldn't matter -- with a 64-bits hash -- so if it's faster for h2 to take the bottom bits, then defining h1 as hash >> 8 would work.

morrisonlevi commented 4 months ago

The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).

Yes, because only roughly 48 bits are actually addressable on x86_64 presently. And I'm not sure of any systems which even approach that size... so when the h1 gets truncated to the bucket mask, those upper bits aren't looked at. It's not worth doing any work to shrink down to the lower 57 bits because those will get filtered by the mask anyway.

bors commented 1 month ago

:umbrella: The latest upstream changes (presumably #558) made this pull request unmergeable. Please resolve the merge conflicts.

rust-lang / hashbrown

WIP Switch to a full bitwidth h2 #513

Changes:

Motivation:

Limitations:

Design: