Open turalcar opened 1 month ago
What kind of numbers do you get from ahash or fxhash?
cc @orlp
For such small sets and such a small input domain you are testing very specific relations between input bits and output bits (only the top 7 output bits for the tag and the bottom log2(n / 16)
output bits for the bucket, if I'm not mistaken). So only 13 of the output bits are being focused on in this test (even less since there's no pre-reserve call), with only 10 input bits.
If an unlucky seed is chosen, you can hit an unlucky avalanche probability, as is documented in the readme for foldhash-f.
In the initialization procedure of foldhash I already force some bits on in the global seed, perhaps I can extend this a bit more to force a few more bits on/off such that the small-input case always at least avalanches into the top 7 bits.
1000 was chosen arbitrarily. I reran this with 0x3800 items for maximum density and averaged over 32768 runs. fxhash is an interesting one - it seems to always use the same seed: 0.226 collisions per insertion. ahash: 0.0637 collisions per insertion std: 0.0637 collisions per insertion foldhash::quality: 0.0637 collisions per insertion. foldhash::fast: I ran this 10 times, avg: 0.0699, min: 0.0422, max: 0.213 The result is similar for 0x3801 (minimum density): all hashes provide 0.0273 collision rate, except fxhash which has exactly 0, and foldhash::fast which gave avg: 0.0292, min: 0.0201, max: 0.0458.
The motivating issue for me was actually small strings. Using to_string() before hashing doesn't change collision rates for most hash functions. For foldhash::fast it becomes consistent but still a bit worse (0.0715 at max density, 0.0349 at min density). For fxhash it becomes 0.133 at max density, 0.0503 at min density.
Upon further investigation it appears that the issue isn't (just) simply that the 7-bit tag has a bad distribution, but also that for small number inputs it can be correlated with the bottom bits for some seeds. I think I can also resolve this problem by choosing seeds more carefully though.
@orlp, I was just wondering if you have further thoughts on this edge case? There are various "low u32" keyed hashtables in https://github.com/arthurprs/canopydb that could be affected by this, and I'm not sure if I need to switch to the "quality" hash to prevent such degenerate cases.
@arthurprs I'm working on analyzing it, at this point I don't have a concrete answer yet. I've collected 1GB worth of seeds along with a performance score on the average number of equality checks performed when inserting the numbers 0..1<<16
into an hashbrown
set, and I'm trying to establish a concrete pattern for what determines whether a seed is good or not, and why. Then I can hopefully update the seed selection to only generate good seeds.
Foldhash-q is rock-solid regardless of seed, so if you want to be 100% certain performance is predictable, it's a good choice, especially in the meantime.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=8d9004344a0ddd7799b90e24f5866e75
This counts collisions. The number fluctuates wildly from 1.5% to 7% and mostly depends on global seed (if one measurement is bad then all of them are). Both
foldhash::quality::RandomState
andahash::RandomState
remain at around 2.9% and would be acceptable as default.