Open dragostis opened 4 years ago
Recent version of Rayon provides us with parallel iteration over a tuple.
(a, b, c).into_par_iter()
.for_each(|(a_i, b_i, c_i)| {
;; write your code here
}
I think this make code look better, even though this doesn't help answering the current question.
Rayon also has two implementations of ParallelExtend
that work like unzip
and partition_map
:
impl<A, B, FromA, FromB> ParallelExtend<(A, B)> for (FromA, FromB)
where
A: Send,
B: Send,
FromA: Send + ParallelExtend<A>,
FromB: Send + ParallelExtend<B>,
{...}
impl<L, R, A, B> ParallelExtend<Either<L, R>> for (A, B)
where
L: Send,
R: Send,
A: Send + ParallelExtend<L>,
B: Send + ParallelExtend<R>,
{...}
These were added in #604, and as noted there, you can also nest this in pairs.
let (vec1, (vec2, (vec3, vec4))) = par_iter.map(|(a, b, c, d)| (a, (b, (c, d)))).unzip();
That should also work for calling par_extend
directly. Perhaps we could add flattened expansions of this similar to the MultiZip
that @dineshadepu mentioned. I'd probably only want that for the unzip
-like fashion, as trying to mix Either
in there for partition_map
will get really complex.
I kind of wish we had impl<C: ParallelExtend> ParallelExtend for &mut C
too, as it would make this automatically able to extend a tuple of mutable references, instead of just by value. That would be a breaking change to add such a blanket impl now. Even testing just for (&mut A, &mut B)
now complains:
error[E0119]: conflicting implementations of trait `iter::ParallelExtend<(_, _)>` for type `(&mut _, &mut _)`:
--> src/iter/unzip.rs:428:1
|
413 | / impl<A, B, FromA, FromB> ParallelExtend<(A, B)> for (FromA, FromB)
414 | | where
415 | | A: Send,
416 | | B: Send,
... |
425 | | }
426 | | }
| |_- first implementation here
427 |
428 | / impl<A, B, FromA, FromB> ParallelExtend<(A, B)> for (&'_ mut FromA, &'_ mut FromB)
429 | | where
430 | | A: Send,
431 | | B: Send,
... |
440 | | }
441 | | }
| |_^ conflicting implementation for `(&mut _, &mut _)`
|
= note: downstream crates may implement trait `iter::ParallelExtend<_>` for type `&mut _`
= note: downstream crates may implement trait `iter::ParallelExtend<_>` for type `&mut _`
That should also work for calling
par_extend
directly.
This doesn't seem to work because of the lack of impl<C: ParallelExtend> ParallelExtend for &mut C
.
error[E0599]: no method named `par_extend` found for type `(&mut std::vec::Vec<_>, &mut std::vec::Vec<_>)` in the current scope
--> src/main.rs:13:28
|
13 | (&mut vec0, &mut vec1).par_extend((1u32..10).into_par_iter().map(|n| (n, n)));
| ^^^^^^^^^^ method not found in `(&mut std::vec::Vec<_>, &mut std::vec::Vec<_>)`
error: aborting due to previous error
I'm using this as a temporary fix:
struct ExtendVec<'a, T> {
vec: &'a mut Vec<T>,
}
impl<'a, T> ExtendVec<'a, T> {
pub fn new(vec: &'a mut Vec<T>) -> Self {
Self { vec }
}
}
impl<T: Send> ParallelExtend<T> for ExtendVec<'_, T> {
fn par_extend<I>(&mut self, par_iter: I)
where I: IntoParallelIterator<Item = T>
{
self.vec.par_extend(par_iter);
}
}
This seems to cause rustc to compile very slowly.
I'm using this as a temporary fix:
Ah, a newtype wrapper does the trick -- we could even make something generic like that.
This seems to cause rustc to compile very slowly.
I'm not surprised. While the generic unzip.rs
is not a lot of code, it's a fairly complex composition, and it will compound quickly when those are nested together. It might fare a little better if you nest that like a binary tree, (((vec0, vec1), (vec2, vec3)), ((vec4, vec5), (vec6, vec7)))
, but it will still be a challenge.
In a presentation I gave last year, I demonstrated a giant symbol -- 132967 characters! -- from just a 3-way collect. See also #671, which was alleviated some by #673 and rust-lang/rust#62429, but the overall structure still remains the same.
If we had something more directly designed for just unzipping vectors, it wouldn't need to be nearly so complicated.
I guess it's probably because of the giant symbols and the monomorphization that this is so slow. The tree arrangement doesn't help much, unfortunately.
I was thinking whether there would be a way to maybe implement par_extend
for each type of tuple that I'm using separately, or maybe have an 8-way unzip implementation.
Even if the solution is dirty, it would still be much better than my current unsafe-riddled approach. Do you think any of these ideas could help reduce the costs?
I was taking a look at the unzip implementation and it seems like even a manual 8-way tuple would probably produce the same result.
@cuviper, it seems like finding a way to force the Cosumer
to be the same in all generic types would be a way to solve this, but this would mean that any UnzipN
type would not be able to implement the ParallelIterator
trait. Do you think there's any way around this? Maybe the consumers can be made into trait objects and saved in a Vec
, but I'm not completely sure they are convertable to trait objects and whether or not this would impact performance in any way.
I'm at a loss for designing a better generic unzip. It's really difficult that we can only get each ParallelExtend
consumer through a callback into drive_unindexed
.
Maybe we need a new trait with a method that directly creates a consumer -- although that also means each collection's consumer type would become part of the public API.
So, my current approach is basically extending CollectConsumer
to CollectTupleConsumer
. I've got a working MVP, but I don't know if this is the best approach. However, it does compile fast enough.
@cuviper, would it desirable to have the ParallelExtend
trait be defined for larger tuples, e.g. up to 16? While my workaround works well, I'd rather not have to use unsafe for this.
In order to gain a performance boost from auto-vectorization, I'm storing data in multiple
Vec
s. Think struct of arrays instead of array of structs. In the array of structs case, it's very easy to simply extend the vector with whatever the parallel iterator provides. However, in the other case I don't have this luxury.I came up with two solutions:
Or use uninitialized memory:
The problem I have with the second solution is that it's very painful to use and it litters all my code with a lot of unsafe. However, I don't see any solution to remove it since initializing the vector with default values is visibly costly. Not zipping the iterators together also doesn't work because it will have to iterate through the data multiple times.
Any ideas on how to improve this?