oneapi-src / oneTBB

oneAPI Threading Building Blocks (oneTBB)
https://oneapi-src.github.io/oneTBB/
Apache License 2.0
5.68k stars 1.02k forks source link

[Feature] non-deterministic `parallel_deterministic_reduce`, or maybe `parallel_pairwise_reduce`? #231

Closed NIC0NIC0NI closed 3 months ago

NIC0NIC0NI commented 4 years ago

parallel_deterministic_reduce with auto_partitioner and affinity_partitioner. These two partitioners would make it non-deterministic, but a new name like parallel_pairwise_reduce can be less confusing.

While computing the sum 1+2+3+4+5+6+7+8+9+10+11+12 with 2 threads, simple_partitioner and grain size 2:

For sequence merging, ((1+(2+3))+(4+(5+6))) is faster than (((((1+2)+3)+4)+5)+6). To be more specific, ((1+(2+3))+(4+(5+6))) costs 2+2+3+3+6=16 and (((((1+2)+3)+4)+5)+6) costs 2+3+4+5+6=20. For example, in pstl::partition and pstl::stable_partition, replacing parallel_reduce with parallel_deterministic_reduce would result in better performance (to make it correct, I had to replace auto_partitioner with simple_partitioner and set the grain size equal for two versions).

For summation of positive floating point numbers, such as calculating 1-norm and 2-norm, pair-wise summation has better floating point accuracy, even if it is non-deterministic. That means parallel_deterministic_reduce can produce more accurate result even if auto_partitioner or affinity_partitioner is used.

akukanov commented 4 years ago

Actually, the way you do summation of ranges is under your control with parallel_reduce. E.g. for the functional interface parallel_reduce(range, identity, body, reduction) the body accepts the range and the so-far-accumulated value, and shall return that value updated with the data from the range. But it does not enforce you to sum directly into that value; instead, you can start with the identity value, e.g.:

auto body = [data, identity, reduction](tbb::blocked_range<int>& r, double accum) -> double {
    double sum = identity;
    for(int i=r.begin(); i<r.end(); ++i)
        sum = reduction(sum, data[i]);
    return reduction(accum, sum);
}

The only significant difference is that you need to explicitly pass the value of identity to the body (unless it is well-known, such as 0 for summation); and even that might not be needed if you can initiate the local accumulator (sum) by reducing two first items in the processed range.

Similar approach can be used in the implementation of the body for parallel_reduce(range, body) form of the algorithm.

Thanks for informing about potential performance improvements in Parallel STL algorithms, we will take a look into that.

MikeDvorskiy commented 4 years ago

Kan Liu, Could you please share the code with modified pstl::partition and pstl::stable_partition?
(Or a kind of general reproducer for parallel_reduce, if you have one). We have our own micro-benchmarks and would like to double-check result..

NIC0NIC0NI commented 4 years ago

@MikeDvorskiy

I checked the result and find some mistakes. The reduction in pstl::partition is not O(n+m), but O(min(n,m)). So it is not accelerated by parallel_deterministic_reduce. I apologize for these mistakes.

As for pstl::stable_partition, the reduction (std::rotate) is O(n+m). parallel_deterministic_reduce would be faster than parallel_reduce when using simple_partitioner with a small grain size, but the difference can be marginalized by a big grain size. But since parallel_deterministic_reduce doesn't support auto_partitioner, I didn't test auto_partitioner. Maybe there's little difference under auto_partitioner.

Here is the code

nofuturre commented 3 months ago

@NIC0NIC0NI is this issue still relevant?

nofuturre commented 3 months ago

If anyone encounter this issue in the future please open new issue with a link to this one

NIC0NIC0NI commented 3 months ago

This issue can be closed. Sorry, due to 2FA authentication of github, I cannot log in github in my region now. I replying from email.

---- Replied Message ---- | From | Karolina @.> | | Date | 07/23/2024 15:13 | | To | @.> | | Cc | Kan @.>@.> | | Subject | Re: [oneapi-src/oneTBB] [Feature] non-deterministic parallel_deterministic_reduce, or maybe parallel_pairwise_reduce? (#231) |

If anyone encounter this issue in the future please open new issue with a link to this one

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>