How to use dyn Trait with rayon without performance loss?

kloki commented 1 year ago

I noticed a massive performance loss when looping over a Vec<Box<dyn Trait>> compared to looping over a Vec<struct> using into_par_iter. I'm not sure what is causing and if I'm doing something wrong.

Below I have a snippet that reproduces the behavior.

use rayon::prelude::*;
use std::f64::consts::PI;
use std::time::Instant;

pub trait Body: Sync + Send {
    fn size(&self) -> f64 {
        0.
    }
}

#[derive(Clone)]
pub struct Sphere {
    radius: f64,
}

impl Sphere {
    pub fn new() -> Self {
        Sphere { radius: 4.0 }
    }
}

impl Body for Sphere {
    fn size(&self) -> f64 {
        4.0 / 3.0 * PI * self.radius * self.radius * self.radius
    }
}
pub struct World {
    bodies: Vec<Box<dyn Body>>,
    spheres: Vec<Sphere>,
}

impl World {
    fn new() -> Self {
        let mut world = World {
            bodies: vec![],
            spheres: vec![],
        };
        for _ in 0..100000000 {
            world.spheres.push(Sphere::new());
            world.bodies.push(Box::new(Sphere::new()));
        }
        world
    }
}

pub fn main() {
    println!("Creating structs");
    let world = World::new();
    println!("Normal,Spheres");
    let timer = Instant::now();
    world.spheres.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Normal,Bodies");
    let timer = Instant::now();
    world.bodies.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Spheres");
    let timer = Instant::now();
    world.spheres.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Bodies");
    let timer = Instant::now();
    world.bodies.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());
}

When running with the release build, this produces the output:


Creating structs
Normal,Spheres
elapsed time:97ns
Normal,Bodies
elapsed time:323.960404ms
Parallel,Spheres
elapsed time:88.960257ms
Parallel,Bodies
elapsed time:6.015788183s

adamreichold commented 1 year ago

97ns is suspiciously short and I suspect that the optimizer is computing the sum at ~~runtime~~ compile time.

The slow down is due to consuming the values in the parallel computations whereas the sequential ones just iterate over references. This has a large effect because the Box<dyn Body> case needs ~~dynamic dispatch both to compute size and to drop the values themselves~~ to free the individual heap allocations.

If I just replace .into_par_iter() by .par_iter(), I get

Creating structs
Normal,Spheres
elapsed time:80ns
Normal,Bodies
elapsed time:183.164156ms
Parallel,Spheres
elapsed time:25.692394ms
Parallel,Bodies
elapsed time:153.980776ms

which seems reasonable expect for the 80ns case which is probably not doing any computation at runtime at all.

EDIT: It really is just the parallel freeing and glibc's malloc implementation, not the dynamic dispatch itself. Remove either one from the picture and the pathological slowdown disappears.

adamreichold commented 1 year ago

(As to why the parallel versions suffers so much from freeing the heap allocations, note that the heap is a shared resource and even the best heap allocators will need some level of synchronization which will most likely experience significant contention when all hardware threads are basically freeing allocations from a single global heap.)

adamreichold commented 1 year ago

((For example, using glibc's default allocator I get 15s for your original "Parallel,Bodies" example whereas using a more specialized allocator which is well-suited to such a scenario like snmalloc-rs brings this down to 175ms.))

kloki commented 1 year ago

Thanks for the response. By fixing the into_par_iter. I get similar results.

I expected some slowdown but was unsure how much. I wanted to make sure I wasn't making an obvious mistake.

cuviper commented 1 year ago

I suspect that the optimizer is computing the sum at runtime.

Not even that - it's completely throwing away the unused computation. Try printing it:

    let sum = /*...*/.sum::<f64>();
    println!("elapsed time:{:?}, sum:{sum}", timer.elapsed());

That slows down the serial versions, while the parallel versions are about the same.

In general, you could also use black_box for benchmarking, but it's sometimes tricky to make sure that you don't inhibit too much that way.

rayon-rs / rayon

How to use dyn Trait with rayon without performance loss? #1038