Open Jad2wizard opened 4 months ago
SIMD instruction sets are optimized for operating on independent lanes. This makes the reduce_sum
operation relatively expensive. You can get better performance by using a f32x4
for the sum and then only doing a single horizontal reduction at the end.
Note that the compiler is not able to do this optimization for you, since float operations are not associative.
On a nightly rust version you could also look into the fadd_algebraic
/ fsub_algebraic
/ fmul_agebraic
intrinsics, which allow the compiler to reorder and auto-vectorize scalar floating point operations.
@jhorstmann Thank you so much. After replace the -* of sum_scalar to fsub and fmul, the runtime of sum_scalar decreased by about 20%. But things start to be weird when I change the code to
fn sum_scalar(data: &[f32], res: &mut [f32]) {
let pos = vec![2000.0, 2000.0, 2000.0];
let dir = vec![0.8, 0.6, 0.0];
for i in 0..data.len() / 4 {
let x = data[i * 4 + 0];
let y = data[i * 4 + 1];
let z = data[i * 4 + 2];
res[i] = (x - pos[0]) * dir[0] + (y - pos[1]) * dir[1] + (z - pos[2]) * dir[2];
}
}
unsafe fn sum_fast(data: &[f32], res: &mut [f32]) {
let pos = vec![2000.0, 2000.0, 2000.0];
let dir = vec![0.8, 0.6, 0.0];
for i in 0..data.len() / 4 {
let x = data[i * 4 + 0];
let y = data[i * 4 + 1];
let z = data[i * 4 + 2];
*res.get_unchecked_mut(i) = fmul_fast(fsub_fast(x, pos[0]), dir[0])
+ fmul_fast(fsub_fast(y, pos[1]), dir[1])
+ fmul_fast(fsub_fast(z, pos[2]), dir[2])
}
}
the runtime of sum_scalar and sum_fast become almost equal. I am a beginner to rust, so the results are werid to me.
@Jad2wizard very good question! I see two possible reasons for these having the same performance:
The compiler is not able to remove the bounds checks when accessing data
. This means it still processes a single res
value per loop iteration, and the overhead of the bounds checks masks any performance difference between the intrinsics and normal arithmetic.
The other issue is related to the data layout. For each res
value, the code needs to process 3 different indices of data
values. If the compiler wants to vectorize this to produce 4 results at a time for example, each of the 4 lanes of x
, y
and z
would need to be loaded from a different, non consecutive address. Modern cpus have instructions for that, but they are much slower than loading data that is consecutive in memory. Fixing this would require to change your data structures to have different slices for x
, y
and z
.
Without the data layout changes, this is how I would write the function to avoid the bounds checks:
pub fn sum_fast_no_bound_checks(data: &[f32], res: &mut [f32]) {
let pos = vec![2000.0, 2000.0, 2000.0];
let dir = vec![0.8, 0.6, 0.0];
for (r, chunk) in res.iter_mut().zip(data.chunks_exact(4)) {
let x = chunk[0];
let y = chunk[1];
let z = chunk[2];
*r = fmul_algebraic(fsub_algebraic(x, pos[0]), dir[0])
+ fmul_algebraic(fsub_algebraic(y, pos[1]), dir[1])
+ fmul_algebraic(fsub_algebraic(z, pos[2]), dir[2])
}
}
With chunks_exact
the compiler knows that the indices are in bounds, and zip
ensures that the res corresponding to each chunk is also in bounds.
Thank you @jhorstmann, I will try change the data layout as you mentioned
cpu: Intel(R) Core(TM) i5-14400F 2.50 GHz rustc 1.81.0-nightly (c1b336cb6 2024-06-21)