Closed orlp closed 4 months ago
Note that should this get merged for optimal performance the nightly
feature should be added to all places that include rustc-hash
in the Rust repository. It avoids a bit of overhead in the string / slice hash by avoiding the prefix-free encoding that's not necessary.
My recollection is that rustc doesn't hash a lot of strings so issue 1 didn't seem very impactful in practice, did your investigations show the opposite maybe?
FWIW, a while back I also tried this same hash function for the integers, but with a different constant from the same paper and various finalization ops -- as such a sequence is simple enough that my rust lookalike of hash-prospector finds it very quickly. It wasn't an obvious improvement over the current hash at the time, and had less promising results than #18.
(Edit: for comparison, here are #18's results, cycles here https://perf.rust-lang.org/compare.html?start=d53f1e8fbf891cf84fcb11eb078a27e528df795a&end=71812917089730aa5abc2f6c5978c6718c9b0f5f&stat=cycles%3Au)
The performance run for this PR: https://perf.rust-lang.org/compare.html?start=0160bff4b1bffa241299aba8c8c63e7a3cd871fe&end=c358310bb0ab238e25e60f1e77f96e0a2b270d0b
@lqd The integer hash is the more important one for sure, but there are also some string hashes in the top 99% of usage:
calls % cum%
38646076 (24.3749%) (24.3749%): int(8)
38216888 (24.1042%) (48.4791%): int(4)
24624533 (15.5312%) (64.0103%): int(8), int(8)
16260169 (10.2556%) (74.2660%): int(8), int(8), int(8)
10060303 ( 6.3452%) (80.6112%): int(8), int(8), int(8), int(8)
6120759 ( 3.8605%) (84.4717%): int(4), int(8)
4004318 ( 2.5256%) (86.9973%): int(8), int(8), int(8), int(8), int(8), int(8)
2629811 ( 1.6587%) (88.6560%): int(4), int(4)
2531435 ( 1.5966%) (90.2526%): int(8), int(8), int(8), int(8), int(8)
2394405 ( 1.5102%) (91.7628%): int(2), int(2), int(4)
2037351 ( 1.2850%) (93.0478%): int(4), int(8), int(8), int(8), int(8)
1721253 ( 1.0856%) (94.1335%): int(4), int(4), int(8)
1593315 ( 1.0049%) (95.1384%): int(4), int(8), int(8)
803773 ( 0.5070%) (95.6454%): int(4), int(4), int(4), int(8)
652177 ( 0.4113%) (96.0567%): int(1), int(8)
585849 ( 0.3695%) (96.4262%): int(4), int(4), int(8), int(8)
542942 ( 0.3424%) (96.7687%): str(32), int(8), int(8), int(8), int(8)
404925 ( 0.2554%) (97.0241%): str(4), int(1)
350200 ( 0.2209%) (97.2449%): str(12), int(1)
303208 ( 0.1912%) (97.4362%): str(24), int(1)
243336 ( 0.1535%) (97.5896%): str(6), int(1)
206939 ( 0.1305%) (97.7202%): str(3), int(1)
202990 ( 0.1280%) (97.8482%): str(16), int(1)
186403 ( 0.1176%) (97.9658%): int(4), int(4), int(8), int(8), int(8)
178159 ( 0.1124%) (98.0781%): int(1), int(8), int(8), int(8), int(8), int(8)
177280 ( 0.1118%) (98.1899%): str(5), int(1)
170946 ( 0.1078%) (98.2978%): int(4), int(8), int(8), int(8), int(8), int(8)
166919 ( 0.1053%) (98.4030%): int(4), int(8), int(8), int(8)
155574 ( 0.0981%) (98.5012%): str(2), int(1)
149823 ( 0.0945%) (98.5957%): str(20), int(1)
138569 ( 0.0874%) (98.6831%): str(80), int(1)
125819 ( 0.0794%) (98.7624%): str(7), int(1)
124243 ( 0.0784%) (98.8408%): int(8), int(8), int(8), int(8), int(8), int(8), int(8)
123919 ( 0.0782%) (98.9189%): str(8), int(1)
103888 ( 0.0655%) (98.9845%): int(16)
94770 ( 0.0598%) (99.0442%): int(1), int(8), int(8), int(8), int(16)
It would be good to squash all the commits together before merging this, because they're not logically separate changes.
With the additional clarifications via comments and squashing I am happy with this. I don't normally give r+ to PRs in this repo so I will leave that to @Nilstrieb(?)
@orlp And thanks for answering my questions and helping me get past my initial skepticism :) There have been multiple unsuccessful attempts by multiple people to improve on FxHash, some mentioned in this blog post, so it's nice to finally see success!
Do you think it might be worth renaming Fx*
to something else? We are not really bound by semver here and could just make a breaking change.
git grep
tells me there are 2766 occurences of FxHash
/FxIndex
in the codebase. I don't think it's worth changing the name.
It's just a whole lot of pain for little gain. I think my comment in the readme is fine (and also explains why its called FxHash
):
The original hash algorithm provided by this crate was one taken from Firefox, hence the hasher it provides is called FxHasher. This name is kept for backwards compatibility, but the underlying hash has since been replaced. The current design for the hasher is a polynomial hash finished with a single bit rotation, together with a wyhash-inspired compression function for strings/slices, both designed by Orson Peters.
The current FxHasher design has a few issues:
i
do not affect bits less significant thani
. Since most hash table implementations (including hashbrown) compute their bucket indices from the least significant bits, this means everything can collide for a hash table of size 2^n if all differences are in bits n or above.With this PR I aim to fix 1 by adding a proper string hash, and mitigate 2 by adding a bit rotation as a finalizer. This bit rotation moves the high-entropy high bits into the low bits expected by hash table implementations. I also simplified the hasher to a polynomial hash, which further improved compile times and makes the hasher less of an arbitrary mish-mash of operations and more a combination of justifiable and understandable components.
I will make a dummy PR on rust-lang to do a perf run for this PR.