rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
11.14k stars 502 forks source link

operating by blocks #805

Open wagnerf42 opened 4 years ago

wagnerf42 commented 4 years ago

hello dear all,

i'm trying to merge some of the work i did on (my own) parallel iterators into rayon.

it is a bit lenghty post since i need to explain a few things and i also have some questions.

you can find my code here : https://github.com/wagnerf42/by_blocks

it basically adds two new adaptors to IndexedParallelIterator.

the first one, "by_blocks" takes a sequential iterator on block sizes and will ensure that when the final reduction is launched, instead of having one parallel execution on the whole input you will get a sequence of parallel executions on input blocks.

this proves particularly useful for interruptible computations. in particular, performances for find_first or find_last (nth also i guess) will be greatly improved because if the first target is somewhere on the left of the parallel iterator then the actual rayon's algorithm will split the producer in two and start working both left and right. all of these right hand side computations are in fact useless.

if instead we proceed by blocks with doubling sizes (1,2,4,8,....) we get the following properties

all experiments i did show that blocks have a strong impact on speedup for this operation.

this will also impact any interruptible operations like try_reduce, all, any,.... but to a lesser extent. (it will normalize both worst and best case scenarios)

you can see it in action by running "RAYON_NUM_THREADS=2 cargo run --release" in my repo.

you can view "find.svg" in firefox as an example trace file.

the second adaptor is "by_blocks_iter".

this time the idea is different. it is meant for cases where all of the input needs to be processed but where locality effects take place. blocking is notoriously good for enhancing locality.

in our case we target computations where data are reused between folding and later stages. this is for example the case in "filter + collect" operations where the folding stage creates several vectors which are then concatenated together to form the final vector. if you wait too long between the intial vector production and its reuse in concatenation you will lose the benefits from caches.

in rayon collecting from a filtered iterator is done by folding, mapping vectors to lists, reducing concatenating lists and then looping sequential on the final list.

"by_blocks_iter" is doing exactly that except that the list of folded results is built piecewise.

In this case it is best to select block sizes such that produced data fits the cache for example.

ALL reductions will be executed sequentially between all folded results and you are sure all leftmost reductions are completed before starting folding further to the right.

In the provided example file I provide two different examples:

with handtuned block sizes for my machine.

the gains are not as spectacular as for "find_first" but they are there even for very very fast computations (see timings.txt) and might actually be better on older machines (with no prefetcher) or larger data.

note that this adaptor could also prove itself useful for out of core computations.

a final note on usage is that both these adaptors open the possibility for infinite parallel iterators like "repeat", "repeat_with" or open ranges.

ok, now here are some of my questions. they are intended to ensure that my implementation is correct. (feel free to take a look at then by the way)

0) what is your opinion on both adaptors ?

1) why is UnindexedConsumer extending Consumer and not the opposite ? is there in rayon any Consumer which is not Unindexed ? if yes what does it mean that you can split at i but not split_off_left ?

2) why do we need to split consumers ? is cloning not enough ? which consumer is using the index in split_at ?

3) is it ok to discard the reducer obtained from split_at ? is there in rayon an example where the reducer behaves differently based on the index ?

4) i actually have two implementations of "by_blocks". one using Consumer and the other using UnindexedConsumer. https://github.com/wagnerf42/by_blocks/commit/f4d9e54ed6935b521d27f832cc1c952df1d2f0f7#diff-46b4f952ee6ad3c2fa43328b3e89d475

are both ok ? should i prefer one over the other ?

nikomatsakis commented 4 years ago

@wagnerf42 this is very cool! to answer a few of your questions:

The role of Consumer. The "indexed" consumer is meant to be a consumer that requires precise indices. It used when collecting into a vector -- if we know the exact length, and the exact points where the split occurs, we can allocate the vector up front and just divide up the slices. This works great for stuff like vec.par_iter().map(f).collect(). This also answers why we "split" consumers -- the point of the split is the point where we split the target slice.

The reason that UnindexedConsumer: Consumer is that an unindexed consumer can "accept more" -- i.e., it extends the base consumer type (which gets index information) with new methods that do not require precise index information. Of course if you have precise indices that's not a problem, they can just be discarded.

Regarding the reducer -- I don't remember if that is ok! We certainly prepare for it in the sense of it not being unsafe to discard it, but I think that there may be some consumers or other things which misbehave. We'd have to check that out.

actually have two implementations of "by_blocks". one using Consumer and the other using UnindexedConsumer.

I feel like we might conceivably want both? (i.e., it'd be more efficient, maybe, if indices are available?) I'll have to read the code more closely to answer better.