why sub and mul operation of f32x4 is slower than scalar version?

Jad2wizard commented 4 months ago

//cost 13ms when data.length is 4000w
fn sum_scalar(data: &[f32]) -> f32 {
    let mut sum = 0.0;
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        sum += (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
    sum
}

//cost 16ms when data.length is 4000w
fn sum_simd(data: &[f32]) -> f32 {
    let pos = f32x4::load_or_default(&vec![2000.0, 2000.0, 2000.0]);
    let dir = f32x4::load_or_default(&vec![0.8, 0.6, 0.0]);

    let mut sum = 0.0;
    for i in 0..data.len() / 4 {
        let values = f32x4::from_array([
            data[i * 4 + 0],
            data[i * 4 + 1],
            data[i * 4 + 2],
            data[i * 4 + 3],
        ]);
        sum += ((values - pos) * dir).reduce_sum();
    }

    sum
}

cpu: Intel(R) Core(TM) i5-14400F 2.50 GHz rustc 1.81.0-nightly (c1b336cb6 2024-06-21)

jhorstmann commented 4 months ago

SIMD instruction sets are optimized for operating on independent lanes. This makes the reduce_sum operation relatively expensive. You can get better performance by using a f32x4 for the sum and then only doing a single horizontal reduction at the end.

Note that the compiler is not able to do this optimization for you, since float operations are not associative.

On a nightly rust version you could also look into the fadd_algebraic / fsub_algebraic / fmul_agebraic intrinsics, which allow the compiler to reorder and auto-vectorize scalar floating point operations.

Jad2wizard commented 4 months ago

@jhorstmann Thank you so much. After replace the -* of sum_scalar to fsub and fmul, the runtime of sum_scalar decreased by about 20%. But things start to be weird when I change the code to

fn sum_scalar(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        res[i] = (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
    }
}

unsafe fn sum_fast(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for i in 0..data.len() / 4 {
        let x = data[i * 4 + 0];
        let y = data[i * 4 + 1];
        let z = data[i * 4 + 2];
        *res.get_unchecked_mut(i) = fmul_fast(fsub_fast(x, pos[0]), dir[0])
                + fmul_fast(fsub_fast(y, pos[1]), dir[1])
                + fmul_fast(fsub_fast(z, pos[2]), dir[2])
    }
}

the runtime of sum_scalar and sum_fast become almost equal. I am a beginner to rust, so the results are werid to me.

jhorstmann commented 4 months ago

@Jad2wizard very good question! I see two possible reasons for these having the same performance:

The compiler is not able to remove the bounds checks when accessing data. This means it still processes a single res value per loop iteration, and the overhead of the bounds checks masks any performance difference between the intrinsics and normal arithmetic.
The other issue is related to the data layout. For each res value, the code needs to process 3 different indices of data values. If the compiler wants to vectorize this to produce 4 results at a time for example, each of the 4 lanes of x, y and z would need to be loaded from a different, non consecutive address. Modern cpus have instructions for that, but they are much slower than loading data that is consecutive in memory. Fixing this would require to change your data structures to have different slices for x, y and z.

Without the data layout changes, this is how I would write the function to avoid the bounds checks:

pub fn sum_fast_no_bound_checks(data: &[f32], res: &mut [f32]) {
    let pos = vec![2000.0, 2000.0, 2000.0];
    let dir = vec![0.8, 0.6, 0.0];
    for (r, chunk) in res.iter_mut().zip(data.chunks_exact(4)) {
        let x = chunk[0];
        let y = chunk[1];
        let z = chunk[2];
        *r = fmul_algebraic(fsub_algebraic(x, pos[0]), dir[0])
                + fmul_algebraic(fsub_algebraic(y, pos[1]), dir[1])
                + fmul_algebraic(fsub_algebraic(z, pos[2]), dir[2])
    }
}

With chunks_exact the compiler knows that the indices are in bounds, and zip ensures that the res corresponding to each chunk is also in bounds.

Jad2wizard commented 4 months ago

Thank you @jhorstmann, I will try change the data layout as you mentioned

rust-lang / portable-simd

why sub and mul operation of f32x4 is slower than scalar version? #426