rust-lang / rustc-hash

Custom hash algorithm used by rustc (plus hashmap/set aliases): fast, deterministic, not secure
Apache License 2.0
377 stars 47 forks source link

Replace hash with faster and better finalized hash #37

Closed orlp closed 4 months ago

orlp commented 4 months ago

The current FxHasher design has a few issues:

  1. It is relatively slow for longer strings/slices.
  2. It has no finalization at all, which means that for the last integer (or in the case of a single integer, the only integer) any differences in bit i do not affect bits less significant than i. Since most hash table implementations (including hashbrown) compute their bucket indices from the least significant bits, this means everything can collide for a hash table of size 2^n if all differences are in bits n or above.

With this PR I aim to fix 1 by adding a proper string hash, and mitigate 2 by adding a bit rotation as a finalizer. This bit rotation moves the high-entropy high bits into the low bits expected by hash table implementations. I also simplified the hasher to a polynomial hash, which further improved compile times and makes the hasher less of an arbitrary mish-mash of operations and more a combination of justifiable and understandable components.

I will make a dummy PR on rust-lang to do a perf run for this PR.

orlp commented 4 months ago

Note that should this get merged for optimal performance the nightly feature should be added to all places that include rustc-hash in the Rust repository. It avoids a bit of overhead in the string / slice hash by avoiding the prefix-free encoding that's not necessary.

lqd commented 4 months ago

My recollection is that rustc doesn't hash a lot of strings so issue 1 didn't seem very impactful in practice, did your investigations show the opposite maybe?

FWIW, a while back I also tried this same hash function for the integers, but with a different constant from the same paper and various finalization ops -- as such a sequence is simple enough that my rust lookalike of hash-prospector finds it very quickly. It wasn't an obvious improvement over the current hash at the time, and had less promising results than #18.

(Edit: for comparison, here are #18's results, cycles here https://perf.rust-lang.org/compare.html?start=d53f1e8fbf891cf84fcb11eb078a27e528df795a&end=71812917089730aa5abc2f6c5978c6718c9b0f5f&stat=cycles%3Au)

orlp commented 4 months ago

The performance run for this PR: https://perf.rust-lang.org/compare.html?start=0160bff4b1bffa241299aba8c8c63e7a3cd871fe&end=c358310bb0ab238e25e60f1e77f96e0a2b270d0b

orlp commented 4 months ago

@lqd The integer hash is the more important one for sure, but there are also some string hashes in the top 99% of usage:

     calls         %       cum%
  38646076 (24.3749%) (24.3749%): int(8)
  38216888 (24.1042%) (48.4791%): int(4)
  24624533 (15.5312%) (64.0103%): int(8), int(8)
  16260169 (10.2556%) (74.2660%): int(8), int(8), int(8)
  10060303 ( 6.3452%) (80.6112%): int(8), int(8), int(8), int(8)
   6120759 ( 3.8605%) (84.4717%): int(4), int(8)
   4004318 ( 2.5256%) (86.9973%): int(8), int(8), int(8), int(8), int(8), int(8)
   2629811 ( 1.6587%) (88.6560%): int(4), int(4)
   2531435 ( 1.5966%) (90.2526%): int(8), int(8), int(8), int(8), int(8)
   2394405 ( 1.5102%) (91.7628%): int(2), int(2), int(4)
   2037351 ( 1.2850%) (93.0478%): int(4), int(8), int(8), int(8), int(8)
   1721253 ( 1.0856%) (94.1335%): int(4), int(4), int(8)
   1593315 ( 1.0049%) (95.1384%): int(4), int(8), int(8)
    803773 ( 0.5070%) (95.6454%): int(4), int(4), int(4), int(8)
    652177 ( 0.4113%) (96.0567%): int(1), int(8)
    585849 ( 0.3695%) (96.4262%): int(4), int(4), int(8), int(8)
    542942 ( 0.3424%) (96.7687%): str(32), int(8), int(8), int(8), int(8)
    404925 ( 0.2554%) (97.0241%): str(4), int(1)
    350200 ( 0.2209%) (97.2449%): str(12), int(1)
    303208 ( 0.1912%) (97.4362%): str(24), int(1)
    243336 ( 0.1535%) (97.5896%): str(6), int(1)
    206939 ( 0.1305%) (97.7202%): str(3), int(1)
    202990 ( 0.1280%) (97.8482%): str(16), int(1)
    186403 ( 0.1176%) (97.9658%): int(4), int(4), int(8), int(8), int(8)
    178159 ( 0.1124%) (98.0781%): int(1), int(8), int(8), int(8), int(8), int(8)
    177280 ( 0.1118%) (98.1899%): str(5), int(1)
    170946 ( 0.1078%) (98.2978%): int(4), int(8), int(8), int(8), int(8), int(8)
    166919 ( 0.1053%) (98.4030%): int(4), int(8), int(8), int(8)
    155574 ( 0.0981%) (98.5012%): str(2), int(1)
    149823 ( 0.0945%) (98.5957%): str(20), int(1)
    138569 ( 0.0874%) (98.6831%): str(80), int(1)
    125819 ( 0.0794%) (98.7624%): str(7), int(1)
    124243 ( 0.0784%) (98.8408%): int(8), int(8), int(8), int(8), int(8), int(8), int(8)
    123919 ( 0.0782%) (98.9189%): str(8), int(1)
    103888 ( 0.0655%) (98.9845%): int(16)
     94770 ( 0.0598%) (99.0442%): int(1), int(8), int(8), int(8), int(16)
nnethercote commented 4 months ago

It would be good to squash all the commits together before merging this, because they're not logically separate changes.

nnethercote commented 4 months ago

With the additional clarifications via comments and squashing I am happy with this. I don't normally give r+ to PRs in this repo so I will leave that to @Nilstrieb(?)

nnethercote commented 4 months ago

@orlp And thanks for answering my questions and helping me get past my initial skepticism :) There have been multiple unsuccessful attempts by multiple people to improve on FxHash, some mentioned in this blog post, so it's nice to finally see success!

WaffleLapkin commented 4 months ago

Do you think it might be worth renaming Fx* to something else? We are not really bound by semver here and could just make a breaking change.

nnethercote commented 4 months ago

git grep tells me there are 2766 occurences of FxHash/FxIndex in the codebase. I don't think it's worth changing the name.

orlp commented 4 months ago

It's just a whole lot of pain for little gain. I think my comment in the readme is fine (and also explains why its called FxHash):

The original hash algorithm provided by this crate was one taken from Firefox, hence the hasher it provides is called FxHasher. This name is kept for backwards compatibility, but the underlying hash has since been replaced. The current design for the hasher is a polynomial hash finished with a single bit rotation, together with a wyhash-inspired compression function for strings/slices, both designed by Orson Peters.