rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
897 stars 81 forks source link

why sub and mul operation of f32x4 is slower than scalar version? #426

Open Jad2wizard opened 4 months ago

Jad2wizard commented 4 months ago
//cost 13ms when data.length is 4000w
fn sum_scalar(data: &[f32]) -> f32 {
    let mut sum = 0.0;
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        sum += (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
    sum
}

//cost 16ms when data.length is 4000w
fn sum_simd(data: &[f32]) -> f32 {
    let pos = f32x4::load_or_default(&vec![2000.0, 2000.0, 2000.0]);
    let dir = f32x4::load_or_default(&vec![0.8, 0.6, 0.0]);

    let mut sum = 0.0;
    for i in 0..data.len() / 4 {
        let values = f32x4::from_array([
            data[i * 4 + 0],
            data[i * 4 + 1],
            data[i * 4 + 2],
            data[i * 4 + 3],
        ]);
        sum += ((values - pos) * dir).reduce_sum();
    }

    sum
}

cpu: Intel(R) Core(TM) i5-14400F 2.50 GHz rustc 1.81.0-nightly (c1b336cb6 2024-06-21)

jhorstmann commented 4 months ago

SIMD instruction sets are optimized for operating on independent lanes. This makes the reduce_sum operation relatively expensive. You can get better performance by using a f32x4 for the sum and then only doing a single horizontal reduction at the end.

Note that the compiler is not able to do this optimization for you, since float operations are not associative.

On a nightly rust version you could also look into the fadd_algebraic / fsub_algebraic / fmul_agebraic intrinsics, which allow the compiler to reorder and auto-vectorize scalar floating point operations.

Jad2wizard commented 4 months ago

@jhorstmann Thank you so much. After replace the -* of sum_scalar to fsub and fmul, the runtime of sum_scalar decreased by about 20%. But things start to be weird when I change the code to

fn sum_scalar(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        res[i] = (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
}

unsafe fn sum_fast(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        *res.get_unchecked_mut(i) = fmul_fast(fsub_fast(x, pos[0]), dir[0])
                + fmul_fast(fsub_fast(y, pos[1]), dir[1])
                + fmul_fast(fsub_fast(z, pos[2]), dir[2])
    }
}

the runtime of sum_scalar and sum_fast become almost equal. I am a beginner to rust, so the results are werid to me.

jhorstmann commented 4 months ago

@Jad2wizard very good question! I see two possible reasons for these having the same performance:

Without the data layout changes, this is how I would write the function to avoid the bounds checks:

pub fn sum_fast_no_bound_checks(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for (r, chunk) in res.iter_mut().zip(data.chunks_exact(4)) {
        let x = chunk[0];
        let y = chunk[1];
        let z = chunk[2];
        *r = fmul_algebraic(fsub_algebraic(x, pos[0]), dir[0])
                + fmul_algebraic(fsub_algebraic(y, pos[1]), dir[1])
                + fmul_algebraic(fsub_algebraic(z, pos[2]), dir[2])
    }
}

With chunks_exact the compiler knows that the indices are in bounds, and zip ensures that the res corresponding to each chunk is also in bounds.

Jad2wizard commented 4 months ago

Thank you @jhorstmann, I will try change the data layout as you mentioned