Open matthieu-m opened 6 months ago
@Amanieu Would it be possible to benchmark on SSE2?
I'd like to see if it's worth it before trying to make the generic and neon targets pass (especially the generics one, all that bit fiddling is a bit involved).
Here are my benchmark results:
name old.txt ns/iter new.txt ns/iter diff ns/iter diff % speedup
clone_from_large 5,064 5,170 106 2.09% x 0.98
clone_from_small 46 46 0 0.00% x 1.00
clone_large 5,141 5,220 79 1.54% x 0.98
clone_small 61 62 1 1.64% x 0.98
grow_insert_ahash_highbits 19,525 20,297 772 3.95% x 0.96
grow_insert_ahash_random 19,889 20,877 988 4.97% x 0.95
grow_insert_ahash_serial 19,530 20,794 1,264 6.47% x 0.94
grow_insert_std_highbits 36,006 36,842 836 2.32% x 0.98
grow_insert_std_random 35,998 36,841 843 2.34% x 0.98
grow_insert_std_serial 35,786 36,865 1,079 3.02% x 0.97
insert 13,911 8,763 -5,148 -37.01% x 1.59
insert_ahash_highbits 17,971 18,048 77 0.43% x 1.00
insert_ahash_random 18,071 17,950 -121 -0.67% x 1.01
insert_ahash_serial 17,980 17,867 -113 -0.63% x 1.01
insert_erase_ahash_highbits 18,830 19,945 1,115 5.92% x 0.94
insert_erase_ahash_random 19,060 19,897 837 4.39% x 0.96
insert_erase_ahash_serial 18,426 19,221 795 4.31% x 0.96
insert_erase_std_highbits 32,468 33,627 1,159 3.57% x 0.97
insert_erase_std_random 33,186 34,270 1,084 3.27% x 0.97
insert_erase_std_serial 32,487 33,589 1,102 3.39% x 0.97
insert_std_highbits 22,173 22,509 336 1.52% x 0.99
insert_std_random 22,249 22,673 424 1.91% x 0.98
insert_std_serial 22,203 22,440 237 1.07% x 0.99
insert_unique_unchecked 5,300 5,614 314 5.92% x 0.94
iter_ahash_highbits 928 927 -1 -0.11% x 1.00
iter_ahash_random 932 926 -6 -0.64% x 1.01
iter_ahash_serial 932 916 -16 -1.72% x 1.02
iter_std_highbits 930 938 8 0.86% x 0.99
iter_std_random 931 912 -19 -2.04% x 1.02
iter_std_serial 946 920 -26 -2.75% x 1.03
lookup_ahash_highbits 3,938 3,989 51 1.30% x 0.99
lookup_ahash_random 4,039 4,208 169 4.18% x 0.96
lookup_ahash_serial 3,829 3,929 100 2.61% x 0.97
lookup_fail_ahash_highbits 3,267 3,389 122 3.73% x 0.96
lookup_fail_ahash_random 3,316 3,502 186 5.61% x 0.95
lookup_fail_ahash_serial 3,367 3,332 -35 -1.04% x 1.01
lookup_fail_std_highbits 9,224 9,561 337 3.65% x 0.96
lookup_fail_std_random 9,337 9,620 283 3.03% x 0.97
lookup_fail_std_serial 9,246 9,318 72 0.78% x 0.99
lookup_std_highbits 9,996 10,278 282 2.82% x 0.97
lookup_std_random 10,077 10,681 604 5.99% x 0.94
lookup_std_serial 10,002 10,569 567 5.67% x 0.95
rehash_in_place 180,307 187,097 6,790 3.77% x 0.96
Overall, this seems like a loss. My guess is that the extra logic needed when handling h2 values is slowing things down.
One thing I wonder, is how many collisions there are in the first place.
Perfect hashes should be akin to uniformly spread values across the 0-127 range today. Given the general formula of the birthday paradox, we get that the probability of at least one collision across 16 elements is 16 15 / (2 128) = 0.9375.
Thus even with only 128 values to choose from the chance of two elements having the same residual h2 is pretty low in the first place. Even in a full group -- which won't be the case most of the time -- the probability of at least one collision is only 93.75%... and if there's a single collision, it means 14 elements (out of 16) have no collision.
This means paying for the extra complexity for every element, but rarely ever needing it.
I tried my best to keep the cost low, but computing h2 is slightly harder, and the "magic" remapping on rehash is definitely not optimal. Perhaps someone with better insight could pick better values, and better instructions.
@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression let top8 = hash >> (MIN_HASH_LEN * 8 - 7)
is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.
In addition, I suggest also considering the following possible implementation godbolt:
const EMPTY: u8 = 0b1111_1111; // 255
const DELETED: u8 = 0b1111_1110; // 254
pub fn h2(hash: u64) -> u8 {
let bit = hash as u8;
bit >> ((bit > 253) as u8)
}
@matthieu-m It seems to me that you did not take into account that on the x64 platform the expression
let top8 = hash >> (MIN_HASH_LEN * 8 - 7)
is a truncation, and therefore you have a slightly incorrect implementation, and therefore we see a bad performance.
I realize that this implementation is also faulty, I was aiming for the top 8 bits, but it seems we only get the top 7 here.
In addition, I suggest also considering the following possible implementation godbolt:
const EMPTY: u8 = 0b1111_1111; // 255 const DELETED: u8 = 0b1111_1110; // 254 #[no_mangle] pub fn h2(hash: u64) -> u8 { let bit = hash as u8; bit >> ((bit > 253) as u8) }
This switches from the top 8 to the bottom 8 bits, and will overlap with h1
. Is there a reason to prefer to the bottom 8 (and adjust h1
) over the top 8?
The trick to shift by 1 to avoid colliding with special values is interesting, and may save cycles compared to an array lookup.
This switches from the top 8 to the bottom 8 bits, and will overlap with
h1
. Is there a reason to prefer to the bottom 8 (and adjusth1
) over the top 8?
To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.
The h1
function is used for indexing and I think it doesn’t matter that it overlaps with h2
. On x64 h1
simply returns provided value (u64).
@matthieu-m I carefully looked at the implementation and you are right, it is not possible to use values other than 127_i8 and 126_i8. In addition to your improvements, I also suggest changing the erase
function:
const EMPTY: u8 = 0b0111_1111; // 127
const DELETED: u8 = 0b0111_1110; // 126
impl RawTableInner {
#[inline]
unsafe fn erase(&mut self, index: usize) {
debug_assert!(self.is_bucket_full(index));
let index_before = index.wrapping_sub(Group::WIDTH) & self.bucket_mask;
let empty_before = Group::load(self.ctrl(index_before)).match_empty();
let empty_after = Group::load(self.ctrl(index)).match_empty();
// Removing if
let empty_group =
(empty_before.leading_zeros() + empty_after.trailing_zeros() < Group::WIDTH) as u8;
let ctrl = DELETED + empty_group;
self.growth_left += empty_group as usize;
self.set_ctrl(index, ctrl);
self.items -= 1;
}
}
It looks like it will improve removal performance: https://godbolt.org/z/5P9oTYe5z
@JustForFun88 You seem to have many (good!) suggestions, and unfortunately I don't have much bandwidth right now.
I can probably get to them eventually (I may have some time in 2-3 weeks), but that's a bit of an eternity momentum-wise.
So, if I may suggest, why don't you checkout the branch, apply your suggested changes, and run the benchmarks? You'll see quickly if it works out or not, and you'll be able to try further ideas as well.
This switches from the top 8 to the bottom 8 bits, and will overlap with
h1
. Is there a reason to prefer to the bottom 8 (and adjusth1
) over the top 8?To be honest, I don’t know why to use the top bits. We just need some bits to skip obvious hash (values) mismatches. It seems (as we use good hasher) that we can take any bits for these purposes, or maybe I don’t understand something.
The
h1
function is used for indexing and I think it doesn’t matter that it overlaps withh2
. On x64h1
simply returns provided value (u64).
You're correct. h1
is used for indexing via % table length which takes the bottom bits, and Swiss Table (Abseil's implementation) uses the top bits for h2
(the residual).
Now, imagine a table of 256 slots:
And therein lies the rub. The goal of h2 is to provide a quick filter across the elements of a group of 16 (contiguous) elements: if you use h2 to decide which group the element goes in, then the h2s of the elements within the group are not uniformly distributed -- at all -- and thus it becomes a very poor filter, and performance will likely suffer.
Thus it's important to try and source h1 and h2 from different, non-overlapping bits, as much as possible. And taking top and bottom is the easiest way to do so. Which of the two takes top and which takes bottom shouldn't matter -- with a 64-bits hash -- so if it's faster for h2 to take the bottom bits, then defining h1 as hash >> 8 would work.
The h1 function is used for indexing and I think it doesn’t matter that it overlaps with h2. On x64 h1 simply returns provided value (u64).
Yes, because only roughly 48 bits are actually addressable on x86_64 presently. And I'm not sure of any systems which even approach that size... so when the h1
gets truncated to the bucket mask, those upper bits aren't looked at. It's not worth doing any work to shrink down to the lower 57 bits because those will get filtered by the mask anyway.
:umbrella: The latest upstream changes (presumably #558) made this pull request unmergeable. Please resolve the merge conflicts.
Changes:
Group
functions are correctly implemented.Motivation:
Using 256 values instead of 130 could theoretically lower the number of false-positive residual matches by close to 50%.
On the other hand, it does make h2 slightly more complicated to compute, and possibly to operate on.
Limitations:
Only the SSE2 is ported at first, not the generic or neon ones, to gauge whether the performance looks worth it.
Design:
The values for EMPTY and DELETED are chosen so as to play well with SSE2, which does not have unsigned vectors. By using the top of the signed range, operations to distinguish between special and non-special are reduced to a single comparison, whereas using the middle of the range would require 2.
On the other hand, the
convert_special_to_empty_and_full_to_deleted
method is more complicated.The results, on my machine, are not encouraging, but my machine is noisy so conducting proper benchmarking is tough.