rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
11.07k stars 501 forks source link

Feature request: values.par_iter().fold_groups(group_size: usize, |left: T, right: T| -> T) #773

Open bergkvist opened 4 years ago

bergkvist commented 4 years ago

Problem:

Potential solution:

Performing 7000 * 25 IO append operations becomes my bottleneck. What if some of these writes could be grouped together in memory first?

Basically, given a group size, an associative fold operation, we could combine individual par_iter-items into groups. Assuming a group size of 25, the number of IO write operations would be reduced from 7000*25 to 7000.

Illustration with a group size of 3. Notice that the last group only combines 2 inputs.

t0---\
t1---->---g0---->
t2---/
t3---\
t4---->---g1---->
t5---/
t6---\
t7---->---g2---->
cuviper commented 4 years ago

The existing chunks might work for you, but maybe you don't want the Vec allocation for each chunk of items.

I guess I could see this as a new method on IndexedParallelIterator, folding in a custom way rather than just into a Vec. I would give that a signature more like our fold:

fn fold_chunks<T, ID, F>(self, chunk_size: usize, identity: ID, fold_op: F) -> Fold<Self, ID, F>
where
    F: Fn(T, Self::Item) -> T + Sync + Send,
    ID: Fn() -> T + Sync + Send,
    T: Send,

A corresponding fold_with -> fold_chunks_with would make sense too.

The signature you gave, with |left: T, right: T| -> T, looks more like what we would call a reduce. Maybe we could also support a chunk/group version of reduce, producing another parallel iterator instead of the final item.

willcrozi commented 2 years ago

I stumbled across this FR after implementing something similar to this a while back, I've brought it up to date here: https://github.com/willcrozi/rayon/tree/fold_chunks

My use case was grouping strings for output, avoiding the extra allocations associated with chunks.

I think it might be slightly different to the above: my fold_chunks is a parallel iterator that yields the sequential fold of each chunk and it's implemented it as a Producer (I used chunks for reference). I guess a chained call to fold would make it equivalent?

If there's interest I'd be happy to modify as needed, then open a PR once I know I'm on the right track (also adding fold_chunks_with).

Also, would it make sense for a try_fold_chunks for fallible ops since the given use case is IO based?