Parallizing nested loops using indices

Hi! Not sure if this is the right place to ask for help but I am trying to port a C HPC code (molecular dynamics) to Rust and I would like to use rayon to translate the OpenMP parallelization I have implemented in the C version.

For reference, you can find the (simplified) C implementation hereafter:

```c typedef struct { double x; double y; double z; } vec3; #pragma omp parallel for for (size_t i = 0; i < nb_particles; ++i) { for (size_t j = i + 1; j < nb_particles; ++j) { double distance = pow(R_STAR, 2) * vec3_distance_square(self.pos[i], self.pos[j]); vec3 du_ij = vec3_splat(-48.0 * EPSILON_STAR * pow(distance, 7) - pow(distance, 4)); // Compute force exerted on particle `i` vec3 force = vec3_mul(du_ij, vec3_sub(self.pos[i], self.pos[j]); vec3_add_assign(self.forces[i], force); // Force exerted on particle `j` is the opposite of the force on `i` vec3_sub_assign(self.forces[j], force); // Update energy of the particle system self.energy += pow(distance, 6) -2.0 * pow(distance, 3); } } ```

The overall approach is similar to an N-body problem, where we use two nested loops to iterate over all the particles and compute their effects on each other. I find that, in Rust, the algorithm is thus more easily expressed using indices rather than elements, very similar to the C version. The following is my first shot at the Rust port (without parallelization):

for i in 0..nb_particles {
    for j in (i + 1)..nb_particles {
        let distance = R_STAR.powi(2) / self.pos[i].distance_square(self.pos[j]);
        let du_ij = Vec3::splat(-48.0 * EPSILON_STAR * (distance.powi(7) - distance.powi(4)));

        // Compute force exerted on particle `i`
        let force = du_ij * (self.pos[i] - self.pos[j]);
        self.forces[i] += force;
        // Force exerted on particle `j` is the opposite of the force on `i`
        self.forces[j] -= force;

        // Update energy of the particle system
        self.energy += dist.powi(6) - 2.0 * dist.powi(3);
    }
}

Using rayon, I tried changing the outer loop to:

(0..nb_particles).into_par_iter().for_each(|i| {
    // -- snip --
});

However, this leads - with no surprise - to the compiler complaining about the variables self.forces and self.energy being captured by the rayon closure, which prevents them to be mutably borrowed.

error[E0596]: cannot borrow `*self.forces` as mutable, as it is a captured variable in a `Fn` closure
  --> src/particle_system.rs:61:17
   |
61 |                 self.forces[i] += force;
   |                 ^^^^^^^^^^^ cannot borrow as mutable

error[E0596]: cannot borrow `*self.forces` as mutable, as it is a captured variable in a `Fn` closure
  --> src/particle_system.rs:62:17
   |
62 |                 self.forces[j] += -force;
   |                 ^^^^^^^^^^^ cannot borrow as mutable

error[E0594]: cannot assign to `*self.energy`, as it is a captured variable in a `Fn` closure
  --> src/particle_system.rs:63:17
   |
63 |                 self.energy += dist.powi(6) - 2.0 * dist.powi(3);
   |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot assign

My questions are: how could this be correctly parallelized using rayon? Is there a more idiomatic way to express these nested loops in Rust (that might also fix my problem)?

I have tried using iterators to process the elements directly but this makes the borrow checker complain a lot about it (and I don't think that fighting with it is the right way to approach this). I also don't want to rely on a Mutex to be able to update the self.forces vector as it is not needed in the original C implementation and would probably hurt performance (which is critical).
Regarding updating self.energy, OpenMP should use thread local variables to accumulate the energy and reduce the threads' local result in the global variable to avoid relying on a mutex or atomic operations. Is there a way to express this in Rust with rayon?

Thanks in advance! I hope that Rust will have a bright future in the field of HPC :slightly_smiling_face:

In the C version, isn't there a data race in the writes to forces[i] and forces[j]? If not, how is that synchronized?

I have tried using iterators to process the elements directly but this makes the borrow checker complain a lot about it (and I don't think that fighting with it is the right way to approach this). I also don't want to rely on a Mutex to be able to update the self.forces vector as it is not needed in the original C implementation and would probably hurt performance (which is critical).

The easiest way to do mutability in rayon is to make that the focus of splitting, like self.forces.par_iter_mut(). You can enumerate that to find the matching i, and then read-only access to self.pos should be fine however you like.

Regarding updating self.energy, OpenMP should use thread local variables to accumulate the energy and reduce the threads' local result in the global variable to avoid relying on a mutex or atomic operations. Is there a way to express this in Rust with rayon?

Rayon is at a bit of a disadvantage here because it's not integrated with the compiler like OpenMP. We have no idea what is in your closures, and the compiler has no idea what rayon wants to do with it, but it's all held together by the Fn + Send + Sync constraints.

But you can still make it work -- I would compute that energy in a map and then use a parallel sum.

Something like:

self.energy += self.forces.par_iter_mut().enumerate().map(|(i, force_i)| {
    let mut energy = 0.0;
    for j in 0..nb_particles {
        if i == j { continue; }

        let distance = R_STAR.powi(2) / self.pos[i].distance_square(self.pos[j]);
        let du_ij = Vec3::splat(-48.0 * EPSILON_STAR * (distance.powi(7) - distance.powi(4)));

        // Compute force exerted on particle `i`
        *force_i += du_ij * (self.pos[i] - self.pos[j]);

        // Update energy of the particle system
        // (but only for i < j so we don't double-count)
        if i < j {
            energy += dist.powi(6) - 2.0 * dist.powi(3);
        }
    }
    energy
}).sum();

The downside is that this can't just negate the opposite force, so that's computed separately in each direction.

If that's expensive, then you could also trade memory for this -- compute the entire triangle of new_force[i][j] (for i < j) in one parallel step, and then do a parallel update of self.force where you add or subtract the new forces as needed. But O(n²) memory might well be worse than the double computation. :)

Another possibility is to synchronize the forces updates with atomics, but it would require some type punning. If these are f64, then you can cast them as u64 with bytemuck::cast_slice_mut. Then AtomicU64::from_mut_slice converts that to an atomic slice -- that method is still unstable, but it's just a pointer cast as long as the alignment matches. Then you can use fetch_update where you cast f64::from_bits, add or subtract the new force, and cast back to_bits(). This will be a lot of memory synchronization, but not as much as a Mutex at least.

I hope these ideas help!

Thanks for your help!

In the C version, isn't there a data race in the writes to forces[i] and forces[j]? If not, how is that synchronized?

Indeed, there is one and I realized it while writing the Rust code! I must have been lucky because I never got different results between the sequential and parallel versions.

I was able to refactor my Rust code with the example you gave and it worked great. It even achieved a 1.5x speedup over the C version, without any data race! :)

The downside is that this can't just negate the opposite force, so that's computed separately in each direction. If that's expensive, then you could also trade memory for this -- compute the entire triangle of new_force[i][j] (for i < j) in one parallel step, and then do a parallel update of self.force where you add or subtract the new forces as needed. But O(n²) memory might well be worse than the double computation.

I want to try this approach to see if I can get a bit more performance out of it. Should I use a ndarray for this? Also, how can I correctly update the self.forces vector using Rayon? I am not sure how I am supposed to do it. Here is the code I came up with for now:

let mut new_forces = vec![Vec3::zero(); nb_particles * nb_particles];

self.energy += new_forces
    .par_iter_mut()
    .enumerate()
    .map(|(i, new_force_i)| {
        let mut energy = 0.0;
        for j in (i + 1)..nb_particles {
            let dist = R_STAR.powi(2) / self.pos[i].distance_square(self.pos[j]);
            let du_ij = Vec3::splat(-48.0 * EPSILON_STAR * (dist.powi(7) - dist.powi(4)));

            *new_force_i += du_ij * (self.pos[i] - self.pos[j]);

            // Update energy of the particle system
            // (but only for i < j so we don't double-count)
            if i < j {
                energy += dist.powi(6) - 2.0 * dist.powi(3);
            }
        }
        energy
    })
    .sum::<f64>();

I was able to refactor my Rust code with the example you gave and it worked great. It even achieved a 1.5x speedup over the C version, without any data race! :)

Hooray for fearless concurrency!

The 2D indexing might look nicer with ndarray, but you can make do without, something like this:

let mut new_forces = vec![Vec3::zero(); nb_particles * nb_particles];

self.energy += new_forces
    .par_chunks_exact_mut(nb_particles) // rows
    .enumerate()
    .map(|(i, new_force_i)| {
        let mut energy = 0.0;
        for j in (i + 1)..nb_particles {
            let dist = R_STAR.powi(2) / self.pos[i].distance_square(self.pos[j]);
            let du_ij = Vec3::splat(-48.0 * EPSILON_STAR * (dist.powi(7) - dist.powi(4)));

            new_force_i[j] += du_ij * (self.pos[i] - self.pos[j]);

            // Update energy of the particle system
            // (no double-counting worry since we already know i < j)
            energy += dist.powi(6) - 2.0 * dist.powi(3);
        }
        energy
    })
    .sum::<f64>();

self.forces
    .par_iter_mut()
    .enumerate()
    .for_each(|(i, force_i)| {
        // Add new_forces[(i, j)] for i < j
        for j in (i + 1)..nb_particles {
            *force_i += new_forces[i * nb_particles + j];
        }

        // Subtract new_forces[(j, i)] for i > j
        for j in 0..i {
            *force_i += new_forces[j * nb_particles + i];
        }
    });

I'm definitely not sure whether this will be faster, but it's worth a shot. It will probably depend on memory effects, how the extra computation of the earlier approach compares to the extra memory traffic for new_forces, especially if that's too big to fit in CPU cache.

Hello! Sorry for the delayed response, I had to work on other uni projects in the mean time.

I'm definitely not sure whether this will be faster, but it's worth a shot. It will probably depend on memory effects, how the extra computation of the earlier approach compares to the extra memory traffic for new_forces, especially if that's too big to fit in CPU cache.

Thanks a lot for your code snippet! Unfortunately, this approach didn't yield any performance improvements over the previous method (at least on my CPU laptop).

As you've thoroughly answered my questions and gave me great code examples on which to base my future HPC Rust endeavors, I am closing this issue. Thanks again for your help! :)

rayon-rs / rayon

Parallizing nested loops using indices #1017