rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
10.8k stars 492 forks source link

How to use dyn Trait with rayon without performance loss? #1038

Open kloki opened 1 year ago

kloki commented 1 year ago

I noticed a massive performance loss when looping over a Vec<Box<dyn Trait>> compared to looping over a Vec<struct> using into_par_iter. I'm not sure what is causing and if I'm doing something wrong.

Below I have a snippet that reproduces the behavior.

use rayon::prelude::*;
use std::f64::consts::PI;
use std::time::Instant;

pub trait Body: Sync + Send {
    fn size(&self) -> f64 {
        0.
    }
}

#[derive(Clone)]
pub struct Sphere {
    radius: f64,
}

impl Sphere {
    pub fn new() -> Self {
        Sphere { radius: 4.0 }
    }
}

impl Body for Sphere {
    fn size(&self) -> f64 {
        4.0 / 3.0 * PI * self.radius * self.radius * self.radius
    }
}
pub struct World {
    bodies: Vec<Box<dyn Body>>,
    spheres: Vec<Sphere>,
}

impl World {
    fn new() -> Self {
        let mut world = World {
            bodies: vec![],
            spheres: vec![],
        };
        for _ in 0..100000000 {
            world.spheres.push(Sphere::new());
            world.bodies.push(Box::new(Sphere::new()));
        }
        world
    }
}

pub fn main() {
    println!("Creating structs");
    let world = World::new();
    println!("Normal,Spheres");
    let timer = Instant::now();
    world.spheres.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Normal,Bodies");
    let timer = Instant::now();
    world.bodies.iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Spheres");
    let timer = Instant::now();
    world.spheres.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());

    println!("Parallel,Bodies");
    let timer = Instant::now();
    world.bodies.into_par_iter().map(|s| s.size()).sum::<f64>();
    println!("elapsed time:{:?}", timer.elapsed());
}

When running with the release build, this produces the output:


Creating structs
Normal,Spheres
elapsed time:97ns
Normal,Bodies
elapsed time:323.960404ms
Parallel,Spheres
elapsed time:88.960257ms
Parallel,Bodies
elapsed time:6.015788183s
adamreichold commented 1 year ago

97ns is suspiciously short and I suspect that the optimizer is computing the sum at runtime compile time.

The slow down is due to consuming the values in the parallel computations whereas the sequential ones just iterate over references. This has a large effect because the Box<dyn Body> case needs dynamic dispatch both to compute size and to drop the values themselves to free the individual heap allocations.

If I just replace .into_par_iter() by .par_iter(), I get

Creating structs
Normal,Spheres
elapsed time:80ns
Normal,Bodies
elapsed time:183.164156ms
Parallel,Spheres
elapsed time:25.692394ms
Parallel,Bodies
elapsed time:153.980776ms

which seems reasonable expect for the 80ns case which is probably not doing any computation at runtime at all.

EDIT: It really is just the parallel freeing and glibc's malloc implementation, not the dynamic dispatch itself. Remove either one from the picture and the pathological slowdown disappears.

adamreichold commented 1 year ago

(As to why the parallel versions suffers so much from freeing the heap allocations, note that the heap is a shared resource and even the best heap allocators will need some level of synchronization which will most likely experience significant contention when all hardware threads are basically freeing allocations from a single global heap.)

adamreichold commented 1 year ago

((For example, using glibc's default allocator I get 15s for your original "Parallel,Bodies" example whereas using a more specialized allocator which is well-suited to such a scenario like snmalloc-rs brings this down to 175ms.))

kloki commented 1 year ago

Thanks for the response. By fixing the into_par_iter. I get similar results.

I expected some slowdown but was unsure how much. I wanted to make sure I wasn't making an obvious mistake.

cuviper commented 1 year ago

I suspect that the optimizer is computing the sum at runtime.

Not even that - it's completely throwing away the unused computation. Try printing it:

    let sum = /*...*/.sum::<f64>();
    println!("elapsed time:{:?}, sum:{sum}", timer.elapsed());

That slows down the serial versions, while the parallel versions are about the same.

In general, you could also use black_box for benchmarking, but it's sometimes tricky to make sure that you don't inhibit too much that way.