rust-lang / rustc-hash

Custom hash algorithm used by rustc (plus hashmap/set aliases): fast, deterministic, not secure
Apache License 2.0
377 stars 47 forks source link

Slowdown when inserting elements in the iteration order of another FxHashMap #45

Open FeldrinH opened 3 months ago

FeldrinH commented 3 months ago

It seems that in 2.0.0 there has been a regression in the fix to https://github.com/rust-lang/rustc-hash/issues/14.

In 1.2.0 it was possible to avoid the issue by using different seeds for different FxHashMaps (by manually setting the seed or using a randomized seed). This no longer works in 2.0.0. Even if you use two FxHashMaps/FxHashSets with different seeds you still get extremely bad performance when values are inserted in the iteration order of another FxHashMap/FxHashSet.

For example, try this:

fn main() {
    let mut rnd = rand::thread_rng();

    let t0 = Instant::now();
    let mut h1 = HashSet::with_hasher(FxSeededState::with_seed(9943));
    for _ in 0..10_000_000 {
        h1.insert(rnd.gen::<u64>());
    }
    let t1 = Instant::now();
    println!("build: {}", (t1 - t0).as_secs_f64());

    let t0 = Instant::now();
    let mut h2 = HashSet::with_hasher(FxSeededState::with_seed(1829319203));
    for &n in h1.iter() {
        h2.insert(n);
    }
    let t1 = Instant::now();
    println!("rebuild: {}", (t1 - t0).as_secs_f64());

    println!("{}", h2.len());
}

On my computer this takes 0.3 seconds for the inital build, but 2.6 seconds for the rebuild with 2.0.0. With 1.2.0 this takes 0.3 seconds for both build and rebuild.

Note that FxHashSetRand and FxHashMapRand have the same problem, since no matter what seeds they generate you get more or less the same 8-9x slowdown.

orlp commented 3 months ago

The problem is that the new hash is completely linear with the seed, which still runs into the accidentally-quadratic probing issue.

FeldrinH commented 1 month ago

Is there any hope that this issue will be fixed in the near term? If not then I think a warning should be added to the readme and/or documentation. This issue makes the current hashing algorithm a potential source of extreme slowdown in any use case where hahsmaps are serialized and deserialized.