rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
10.98k stars 497 forks source link

Extending some number of Vecs #721

Open dragostis opened 4 years ago

dragostis commented 4 years ago

In order to gain a performance boost from auto-vectorization, I'm storing data in multiple Vecs. Think struct of arrays instead of array of structs. In the array of structs case, it's very easy to simply extend the vector with whatever the parallel iterator provides. However, in the other case I don't have this luxury.

I came up with two solutions:

vec1.reserve(...);
...
vecN.reserve(...);

/// initialize Vec with Default data; but this is visibly expensive in my benchmarks

vec1.par_iter_mut().zip(...).map(|(val, ...)| *val = ...)

Or use uninitialized memory:

let vec1: Vec<MaybeUninit<...>> = ...;
...
let vecN: Vec<MaybeUninit<...>> = ...;

vec1.reserve(...);
...
vecN.reserve(...);

vec1.par_iter_mut().zip(...).map(|(val, ...)| unsafe { ... })

The problem I have with the second solution is that it's very painful to use and it litters all my code with a lot of unsafe. However, I don't see any solution to remove it since initializing the vector with default values is visibly costly. Not zipping the iterators together also doesn't work because it will have to iterate through the data multiple times.

Any ideas on how to improve this?

dineshadepu commented 4 years ago

Recent version of Rayon provides us with parallel iteration over a tuple.

(a, b, c).into_par_iter()
            .for_each(|(a_i, b_i, c_i)| {
     ;; write your code here
}

I think this make code look better, even though this doesn't help answering the current question.

cuviper commented 4 years ago

Rayon also has two implementations of ParallelExtend that work like unzip and partition_map:

impl<A, B, FromA, FromB> ParallelExtend<(A, B)> for (FromA, FromB)
where
    A: Send,
    B: Send,
    FromA: Send + ParallelExtend<A>,
    FromB: Send + ParallelExtend<B>,
{...}

impl<L, R, A, B> ParallelExtend<Either<L, R>> for (A, B)
where
    L: Send,
    R: Send,
    A: Send + ParallelExtend<L>,
    B: Send + ParallelExtend<R>,
{...}

These were added in #604, and as noted there, you can also nest this in pairs.

let (vec1, (vec2, (vec3, vec4))) = par_iter.map(|(a, b, c, d)| (a, (b, (c, d)))).unzip();

That should also work for calling par_extend directly. Perhaps we could add flattened expansions of this similar to the MultiZip that @dineshadepu mentioned. I'd probably only want that for the unzip-like fashion, as trying to mix Either in there for partition_map will get really complex.

I kind of wish we had impl<C: ParallelExtend> ParallelExtend for &mut C too, as it would make this automatically able to extend a tuple of mutable references, instead of just by value. That would be a breaking change to add such a blanket impl now. Even testing just for (&mut A, &mut B) now complains:

error[E0119]: conflicting implementations of trait `iter::ParallelExtend<(_, _)>` for type `(&mut _, &mut _)`:
   --> src/iter/unzip.rs:428:1
    |
413 | / impl<A, B, FromA, FromB> ParallelExtend<(A, B)> for (FromA, FromB)
414 | | where
415 | |     A: Send,
416 | |     B: Send,
...   |
425 | |     }
426 | | }
    | |_- first implementation here
427 |
428 | / impl<A, B, FromA, FromB> ParallelExtend<(A, B)> for (&'_ mut FromA, &'_ mut FromB)
429 | | where
430 | |     A: Send,
431 | |     B: Send,
...   |
440 | |     }
441 | | }
    | |_^ conflicting implementation for `(&mut _, &mut _)`
    |
    = note: downstream crates may implement trait `iter::ParallelExtend<_>` for type `&mut _`
    = note: downstream crates may implement trait `iter::ParallelExtend<_>` for type `&mut _`
dragostis commented 4 years ago

That should also work for calling par_extend directly.

This doesn't seem to work because of the lack of impl<C: ParallelExtend> ParallelExtend for &mut C.

error[E0599]: no method named `par_extend` found for type `(&mut std::vec::Vec<_>, &mut std::vec::Vec<_>)` in the current scope
  --> src/main.rs:13:28
   |
13 |     (&mut vec0, &mut vec1).par_extend((1u32..10).into_par_iter().map(|n| (n, n)));
   |                            ^^^^^^^^^^ method not found in `(&mut std::vec::Vec<_>, &mut std::vec::Vec<_>)`

error: aborting due to previous error

I'm using this as a temporary fix:

struct ExtendVec<'a, T> {
    vec: &'a mut Vec<T>,
}

impl<'a, T> ExtendVec<'a, T> {
    pub fn new(vec: &'a mut Vec<T>) -> Self {
        Self { vec }
    }
}

impl<T: Send> ParallelExtend<T> for ExtendVec<'_, T> {
    fn par_extend<I>(&mut self, par_iter: I)
        where I: IntoParallelIterator<Item = T>
    {
        self.vec.par_extend(par_iter);
    }
}

This seems to cause rustc to compile very slowly.

cuviper commented 4 years ago

I'm using this as a temporary fix:

Ah, a newtype wrapper does the trick -- we could even make something generic like that.

This seems to cause rustc to compile very slowly.

I'm not surprised. While the generic unzip.rs is not a lot of code, it's a fairly complex composition, and it will compound quickly when those are nested together. It might fare a little better if you nest that like a binary tree, (((vec0, vec1), (vec2, vec3)), ((vec4, vec5), (vec6, vec7))), but it will still be a challenge.

In a presentation I gave last year, I demonstrated a giant symbol -- 132967 characters! -- from just a 3-way collect. See also #671, which was alleviated some by #673 and rust-lang/rust#62429, but the overall structure still remains the same.

If we had something more directly designed for just unzipping vectors, it wouldn't need to be nearly so complicated.

dragostis commented 4 years ago

I guess it's probably because of the giant symbols and the monomorphization that this is so slow. The tree arrangement doesn't help much, unfortunately.

I was thinking whether there would be a way to maybe implement par_extend for each type of tuple that I'm using separately, or maybe have an 8-way unzip implementation.

Even if the solution is dirty, it would still be much better than my current unsafe-riddled approach. Do you think any of these ideas could help reduce the costs?

dragostis commented 4 years ago

I was taking a look at the unzip implementation and it seems like even a manual 8-way tuple would probably produce the same result.

@cuviper, it seems like finding a way to force the Cosumer to be the same in all generic types would be a way to solve this, but this would mean that any UnzipN type would not be able to implement the ParallelIterator trait. Do you think there's any way around this? Maybe the consumers can be made into trait objects and saved in a Vec, but I'm not completely sure they are convertable to trait objects and whether or not this would impact performance in any way.

cuviper commented 4 years ago

I'm at a loss for designing a better generic unzip. It's really difficult that we can only get each ParallelExtend consumer through a callback into drive_unindexed.

Maybe we need a new trait with a method that directly creates a consumer -- although that also means each collection's consumer type would become part of the public API.

dragostis commented 4 years ago

So, my current approach is basically extending CollectConsumer to CollectTupleConsumer. I've got a working MVP, but I don't know if this is the best approach. However, it does compile fast enough.

dragostis commented 3 years ago

@cuviper, would it desirable to have the ParallelExtend trait be defined for larger tuples, e.g. up to 16? While my workaround works well, I'd rather not have to use unsafe for this.