using izip is slower than ziping 3 times

RocketRide9 commented 2 months ago

code

use stopwatch::Stopwatch;
use itertools::izip;

const SIZE: usize = 1e8 as usize;

fn main() {
    let mut arr1:Vec<f64> = Vec::new();
    arr1.resize(SIZE, 0.);

    let mut arr2:Vec<f64> = Vec::new();
    arr2.resize(SIZE, 0.);

    let mut arr3:Vec<f64> = Vec::new();
    arr3.resize(SIZE, 0.);

    for (idx, (a1, a2)) in arr1.iter_mut().zip(arr2.iter_mut()).enumerate() {
        *a1 = (idx%10 + 3) as f64;
        *a2 = (idx%10 + 7) as f64;
    }

    let sw = Stopwatch::start_new();
    for i in 0..SIZE {
        arr3[i] = arr1[i] * arr2[i];
    }
    let elapsed = sw.elapsed_ms();

    println!("Thing took {} ms arr3 = {}", elapsed, arr3[20]);

    let sw = Stopwatch::start_new();
    arr3.iter_mut().zip(arr1.iter()).zip(arr2.iter()).for_each(|((a, &a1), &a2)| {*a = a1 * a2;});
    let elapsed = sw.elapsed_ms();

    println!("Thing took {} ms arr3 = {}", elapsed, arr3[20]);

    let sw = Stopwatch::start_new();
    izip!(arr3.iter_mut(), arr1.iter(), arr2.iter()).for_each(|(a, &a1, &a2)| {*a = a1 * a2;});
    let elapsed = sw.elapsed_ms();

    println!("Thing took {} ms arr3 = {}", elapsed, arr3[20]);
}

tests:

sh-5.2$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/sketches`
Thing took 1896 ms arr3 = 21
Thing took 2128 ms arr3 = 21 // manual
Thing took 2888 ms arr3 = 21 // izip
sh-5.2$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/sketches`
Thing took 1769 ms arr3 = 21
Thing took 2013 ms arr3 = 21 // manual
Thing took 2881 ms arr3 = 21 // izip
sh-5.2$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/sketches`
Thing took 1804 ms arr3 = 21
Thing took 1981 ms arr3 = 21 // manual
Thing took 2898 ms arr3 = 21 // izip

phimuemue commented 2 months ago

It seems that you are running unoptimized builds.

can you re-run with —release and post the results?

RocketRide9 commented 2 months ago

@phimuemue

sh-5.2$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/sketches`
Thing took 134 ms arr3 = 21
Thing took 144 ms arr3 = 21
Thing took 168 ms arr3 = 21
sh-5.2$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/sketches`
Thing took 130 ms arr3 = 21
Thing took 142 ms arr3 = 21
Thing took 143 ms arr3 = 21
sh-5.2$ cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/sketches`
Thing took 134 ms arr3 = 21
Thing took 143 ms arr3 = 21
Thing took 147 ms arr3 = 21

scottmcm commented 2 months ago

TBF, I don't trust millisecond-level differences from a "run once with stopwatch" perf.

Please demonstrate with something like criterion that there's a statistically-significant difference here, or show that it optimizes differently.

RocketRide9 commented 2 months ago

Optimized version looks good, no? Difference in time is comparable to measurement error. The issue is that debug version works around 0,7 seconds slower. Speed in debug version matters too, isnt it?

Philippe-Cholet commented 2 months ago

izip!(a, b, c) expands to a.into_iter().zip(b).zip(c).map(|((x, y), z)| (x, y, z)).

so the difference for me here is .map(|((x, y), z)| (x, y, z)).for_each(|(x, y, z)| ...) vs .for_each(|((x, y), z)| ...). If there is a difference, then first it's probably subtle (maybe less subtle in debug mode), and more importantly we can't do much about it because we merely rely on libcore.

scottmcm commented 2 months ago

Speed in debug version matters too, isnt it?

To be frank, no it doesn't. The default debug config doesn't even attempt to produce reasonable machine code.

If you want your debug config to have non-terrible performance, I recommend setting opt-level=1 for it. That's not much slower to compile, and it's way faster at runtime. The difference between don't even try and just do the easy stuff is massive.

RocketRide9 commented 2 months ago

If you want your debug config to have non-terrible performance, I recommend setting opt-level=1 for it.

that's way better:

sh-5.2$ cargo run
    Finished dev [optimized + debuginfo] target(s) in 0.01s
     Running `target/debug/sketches`
Thing took 157 ms arr3 = 21
Thing took 173 ms arr3 = 21
Thing took 173 ms arr3 = 21
sh-5.2$ cargo run
    Finished dev [optimized + debuginfo] target(s) in 0.01s
     Running `target/debug/sketches`
Thing took 153 ms arr3 = 21
Thing took 177 ms arr3 = 21
Thing took 177 ms arr3 = 21
sh-5.2$ cargo run
    Finished dev [optimized + debuginfo] target(s) in 0.01s
     Running `target/debug/sketches`
Thing took 154 ms arr3 = 21
Thing took 175 ms arr3 = 21
Thing took 175 ms arr3 = 21

it's interesting that first multiplication is always slightly faster than others even after changing order. I copied first loop and pasted after the third one (which uses izip) and it has the same speed as 2nd and 3rd.

Anyway, if it's expected that default debug config in rust is so slow, i think this issue can be closed?

rust-itertools / itertools

using izip is slower than ziping 3 times #926