1.65x performance regression from 1.35.0 to 1.36.0

utterstep commented 5 years ago

Seeing ~2x (46ms vs 28ms) performance regression after updating 1.35.0 to 1.36.0 (probably due to new HashMap implementation?).

Steps to reproduce

Clone this repo: https://github.com/utter-step/advent-2018
rustup override set 1.35.0 && cargo build -p day-6 --release
measure APP_PART=one APP_INPUT_FILE=day-6/full.txt target/release/day-6, APP_PART=two APP_INPUT_FILE=day-6/full.txt APP_MAX_DISTANCE=10000 target/release/day-6. I used ipython for this, for example:
1. import os
2. %timeit os.system('APP_PART=one APP_INPUT_FILE=day-6/full.txt target/release/day-6 > /dev/null')
3. %timeit os.system('APP_PART=two APP_INPUT_FILE=day-6/full.txt APP_MAX_DISTANCE=10000 target/release/day-6 > /dev/null')
rustup override set 1.36.0 && cargo build -p day-6 --release
repeat step 3 measurements

My system specs

Ubuntu 18.04 x86-64
Intel Core i7-8750H (2.2GHz / 4.1GHz Turbo), cpu governor set to performance for time estimation purposes

Results 1.35.0:

part one: 15.8 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
part two: 11.7 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1.36.0:

part one: 23 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
part two: 22.6 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Main logic is in this file: https://github.com/utter-step/advent-2018/blob/master/day-6/src/world/mod.rs, it's a solution to Advent of Code 2018 day 6 problem.

Part one is using HashMap::iter_mut a lot: https://github.com/utter-step/advent-2018/blob/master/day-6/src/world/mod.rs#L46 Part two is using HashMap::keys a lot: https://github.com/utter-step/advent-2018/blob/master/day-6/src/world/mod.rs#L83

Meta 1.35.0:

rustc --version
rustc 1.35.0 (3c235d560 2019-05-20)

1.36.0:

rustc --version
rustc 1.36.0 (a53f9df32 2019-07-03)

Hope that info I provided would be useful!

Vlad

utterstep commented 5 years ago

Extracting keys to vec before iteration solves the problem for part two (and reduces effect for the first part), so I'd assume that iteration over HashMap is significantly slower in 1.36 than it was in 1.35

snippet 1:

self.boundaries
    .into_iter()
    .filter(|(x, y)| {
        self.points
            .keys()
            .map(|point| point.distance_to(*x, *y))
            .sum::<i32>()
            < max_distance
    })
    .count()

1.35.0: 11.7 ms ± 857 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.36.0: 22.6 ms ± 156 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

1.93x time ratio

snippet 2:

// Point is struct { x: i32, y: i32 } with derived Hash
// self.points is HashMap<Point, i32>
let points = self.points.keys().collect::<Vec<_>>();

self.boundaries
    .into_iter()
    .filter(|(x, y)| {
        points
            .iter()
            .map(|point| point.distance_to(*x, *y))
            .sum::<i32>()
            < max_distance
    })
    .count()

1.35.0: 11 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.36.0: 9.74 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

0.88x time ratio

snippet 2 is actually faster in 1.36, so it may be the case, that other HashMap operations is faster in 1.36 (as expected), but iteration over values is significantly slower

Same goes for part one (iter_mut extraction solves the regression):

snippet 3:

for (x, y) in &self.boundaries {
    let mut min_distance = i32::max_value();
    let mut closest = None;

    for (point, area) in self.points.iter_mut() {
        ...
    }
    ...

1.35.0: 15.8 ms ± 50.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.36.0 23 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

1.45x time ratio

snippet 4:

// Point is struct { x: i32, y: i32 } with derived Hash
// self.points is HashMap<Point, i32>
let mut points = self.points.iter_mut().collect::<Vec<_>>();

for (x, y) in &self.boundaries {
    let mut min_distance = i32::max_value();
    let mut closest = None;

    for (point, area) in points.iter_mut() {
        ...
    }
    ...

1.35.0: 12.9 ms ± 50.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.36.0: 14.8 ms ± 24.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1.14x time ratio

RalfJung commented 5 years ago

Cc @Amanieu for hashbrown

utterstep commented 5 years ago

Here is reduced reproduction of the case:

use std::collections::HashSet;

const N_ITEMS: usize = 50;
const N_REPEATS: usize = 10000001;

fn main() {
    let mut set = HashSet::new();

    for i in 0..N_ITEMS {
        set.insert(i ^ 13);
    }

    let mut res = 0;

    for _ in 0..N_REPEATS {
        for item in set.iter() {
            res ^= item;
        }
    }

    println!("sum: {}", res);
}

1.35.0:

utterstep@utterstep-nix:~/my/playground/hashset-perf-test$ /usr/bin/time -v ./hashset-perf-test-1.35 > /dev/null
        Command being timed: "./hashset-perf-test-1.35"
        User time (seconds): 0.41
        System time (seconds): 0.00
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.41
        ...

1.36.0:

utterstep@utterstep-nix:~/my/playground/hashset-perf-test$ /usr/bin/time -v ./hashset-perf-test-1.36 > /dev/null
        Command being timed: "./hashset-perf-test-1.36"
        User time (seconds): 0.76
        System time (seconds): 0.00
        Percent of CPU this job got: 100%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.76
        ...

Amanieu commented 5 years ago

HashMap iteration is slower for small sizes, but faster for larger sizes (e.g. 10k elements).

Amanieu commented 3 years ago

Closing in favor of https://github.com/rust-lang/hashbrown/issues/278.

rust-lang / rust

1.65x performance regression from 1.35.0 to 1.36.0 #62439