`foldhash::fast::RandomState` is very bad

rust-lang / hashbrown

Rust port of Google's SwissTable hash map

https://rust-lang.github.io/hashbrown

Apache License 2.0

2.46k stars 288 forks source link

`foldhash::fast::RandomState` is very bad #577

Open turalcar opened 1 month ago

turalcar commented 1 month ago

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=8d9004344a0ddd7799b90e24f5866e75

This counts collisions. The number fluctuates wildly from 1.5% to 7% and mostly depends on global seed (if one measurement is bad then all of them are). Both foldhash::quality::RandomState and ahash::RandomState remain at around 2.9% and would be acceptable as default.

Amanieu commented 1 month ago

What kind of numbers do you get from ahash or fxhash?

cc @orlp

orlp commented 1 month ago

For such small sets and such a small input domain you are testing very specific relations between input bits and output bits (only the top 7 output bits for the tag and the bottom log2(n / 16) output bits for the bucket, if I'm not mistaken). So only 13 of the output bits are being focused on in this test (even less since there's no pre-reserve call), with only 10 input bits.

If an unlucky seed is chosen, you can hit an unlucky avalanche probability, as is documented in the readme for foldhash-f.

In the initialization procedure of foldhash I already force some bits on in the global seed, perhaps I can extend this a bit more to force a few more bits on/off such that the small-input case always at least avalanches into the top 7 bits.

turalcar commented 1 month ago

1000 was chosen arbitrarily. I reran this with 0x3800 items for maximum density and averaged over 32768 runs. fxhash is an interesting one - it seems to always use the same seed: 0.226 collisions per insertion. ahash: 0.0637 collisions per insertion std: 0.0637 collisions per insertion foldhash::quality: 0.0637 collisions per insertion. foldhash::fast: I ran this 10 times, avg: 0.0699, min: 0.0422, max: 0.213 The result is similar for 0x3801 (minimum density): all hashes provide 0.0273 collision rate, except fxhash which has exactly 0, and foldhash::fast which gave avg: 0.0292, min: 0.0201, max: 0.0458.

turalcar commented 1 month ago

The motivating issue for me was actually small strings. Using to_string() before hashing doesn't change collision rates for most hash functions. For foldhash::fast it becomes consistent but still a bit worse (0.0715 at max density, 0.0349 at min density). For fxhash it becomes 0.133 at max density, 0.0503 at min density.

orlp commented 1 month ago

Upon further investigation it appears that the issue isn't (just) simply that the 7-bit tag has a bad distribution, but also that for small number inputs it can be correlated with the bottom bits for some seeds. I think I can also resolve this problem by choosing seeds more carefully though.

arthurprs commented 4 weeks ago

@orlp, I was just wondering if you have further thoughts on this edge case? There are various "low u32" keyed hashtables in https://github.com/arthurprs/canopydb that could be affected by this, and I'm not sure if I need to switch to the "quality" hash to prevent such degenerate cases.

orlp commented 4 weeks ago

@arthurprs I'm working on analyzing it, at this point I don't have a concrete answer yet. I've collected 1GB worth of seeds along with a performance score on the average number of equality checks performed when inserting the numbers 0..1<<16 into an hashbrown set, and I'm trying to establish a concrete pattern for what determines whether a seed is good or not, and why. Then I can hopefully update the seed selection to only generate good seeds.

Foldhash-q is rock-solid regardless of seed, so if you want to be 100% certain performance is predictable, it's a good choice, especially in the meantime.