operating by blocks - Githubissues

hello dear all,

i'm trying to merge some of the work i did on (my own) parallel iterators into rayon.

it is a bit lenghty post since i need to explain a few things and i also have some questions.

you can find my code here : https://github.com/wagnerf42/by_blocks

it basically adds two new adaptors to IndexedParallelIterator.

the first one, "by_blocks" takes a sequential iterator on block sizes and will ensure that when the final reduction is launched, instead of having one parallel execution on the whole input you will get a sequence of parallel executions on input blocks.

this proves particularly useful for interruptible computations. in particular, performances for find_first or find_last (nth also i guess) will be greatly improved because if the first target is somewhere on the left of the parallel iterator then the actual rayon's algorithm will split the producer in two and start working both left and right. all of these right hand side computations are in fact useless.

if instead we proceed by blocks with doubling sizes (1,2,4,8,....) we get the following properties

the overheads due to distribution the work again and again are bounded by log(n) since we have a logarithmic number of blocks
if the first target is at block j this means all work done in blocks from 1 to j-1 was useful and is also achieved in the sequential algorithm. this the amount of "wasted" work is bounded by the size of the last block which is approximately equal to the sum of all sizes coming before. for example if we stop at block 5: work on blocks of sizes 1 2 4 8 was useful, block 5 if of size 16 and if we are unlucky the target is the first element of this block so all parallel computations to the right in block 5 are useless. but since 1+2+4+8 = 15 almost 16 we have a guarantee than no more than half the work can be wasted.

all experiments i did show that blocks have a strong impact on speedup for this operation.

this will also impact any interruptible operations like try_reduce, all, any,.... but to a lesser extent. (it will normalize both worst and best case scenarios)

you can see it in action by running "RAYON_NUM_THREADS=2 cargo run --release" in my repo.

you can view "find.svg" in firefox as an example trace file.

the second adaptor is "by_blocks_iter".

this time the idea is different. it is meant for cases where all of the input needs to be processed but where locality effects take place. blocking is notoriously good for enhancing locality.

in our case we target computations where data are reused between folding and later stages. this is for example the case in "filter + collect" operations where the folding stage creates several vectors which are then concatenated together to form the final vector. if you wait too long between the intial vector production and its reuse in concatenation you will lose the benefits from caches.

in rayon collecting from a filtered iterator is done by folding, mapping vectors to lists, reducing concatenating lists and then looping sequential on the final list.

"by_blocks_iter" is doing exactly that except that the list of folded results is built piecewise.

In this case it is best to select block sizes such that produced data fits the cache for example.

ALL reductions will be executed sequentially between all folded results and you are sure all leftmost reductions are completed before starting folding further to the right.

In the provided example file I provide two different examples:

a filter collect
a prefix computation

with handtuned block sizes for my machine.

the gains are not as spectacular as for "find_first" but they are there even for very very fast computations (see timings.txt) and might actually be better on older machines (with no prefetcher) or larger data.

note that this adaptor could also prove itself useful for out of core computations.

a final note on usage is that both these adaptors open the possibility for infinite parallel iterators like "repeat", "repeat_with" or open ranges.

ok, now here are some of my questions. they are intended to ensure that my implementation is correct. (feel free to take a look at then by the way)

0) what is your opinion on both adaptors ?

1) why is UnindexedConsumer extending Consumer and not the opposite ? is there in rayon any Consumer which is not Unindexed ? if yes what does it mean that you can split at i but not split_off_left ?

2) why do we need to split consumers ? is cloning not enough ? which consumer is using the index in split_at ?

3) is it ok to discard the reducer obtained from split_at ? is there in rayon an example where the reducer behaves differently based on the index ?

4) i actually have two implementations of "by_blocks". one using Consumer and the other using UnindexedConsumer. https://github.com/wagnerf42/by_blocks/commit/f4d9e54ed6935b521d27f832cc1c952df1d2f0f7#diff-46b4f952ee6ad3c2fa43328b3e89d475

are both ok ? should i prefer one over the other ?

rayon-rs / rayon

operating by blocks #805