Closed NIC0NIC0NI closed 3 months ago
Actually, the way you do summation of ranges is under your control with parallel_reduce
. E.g. for the functional interface parallel_reduce(range, identity, body, reduction)
the body accepts the range and the so-far-accumulated value, and shall return that value updated with the data from the range. But it does not enforce you to sum directly into that value; instead, you can start with the identity value, e.g.:
auto body = [data, identity, reduction](tbb::blocked_range<int>& r, double accum) -> double {
double sum = identity;
for(int i=r.begin(); i<r.end(); ++i)
sum = reduction(sum, data[i]);
return reduction(accum, sum);
}
The only significant difference is that you need to explicitly pass the value of identity to the body (unless it is well-known, such as 0 for summation); and even that might not be needed if you can initiate the local accumulator (sum
) by reducing two first items in the processed range.
Similar approach can be used in the implementation of the body for parallel_reduce(range, body)
form of the algorithm.
Thanks for informing about potential performance improvements in Parallel STL algorithms, we will take a look into that.
Kan Liu,
Could you please share the code with modified pstl::partition and pstl::stable_partition?
(Or a kind of general reproducer for parallel_reduce, if you have one).
We have our own micro-benchmarks and would like to double-check result..
@MikeDvorskiy
I checked the result and find some mistakes. The reduction in pstl::partition
is not O(n+m), but O(min(n,m)). So it is not accelerated by parallel_deterministic_reduce
. I apologize for these mistakes.
As for pstl::stable_partition
, the reduction (std::rotate
) is O(n+m). parallel_deterministic_reduce
would be faster than parallel_reduce
when using simple_partitioner
with a small grain size, but the difference can be marginalized by a big grain size. But since parallel_deterministic_reduce
doesn't support auto_partitioner
, I didn't test auto_partitioner
. Maybe there's little difference under auto_partitioner
.
Here is the code
@NIC0NIC0NI is this issue still relevant?
If anyone encounter this issue in the future please open new issue with a link to this one
This issue can be closed. Sorry, due to 2FA authentication of github, I cannot log in github in my region now. I replying from email.
---- Replied Message ----
| From | Karolina @.> |
| Date | 07/23/2024 15:13 |
| To | @.> |
| Cc | Kan @.>@.> |
| Subject | Re: [oneapi-src/oneTBB] [Feature] non-deterministic parallel_deterministic_reduce
, or maybe parallel_pairwise_reduce
? (#231) |
If anyone encounter this issue in the future please open new issue with a link to this one
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
parallel_deterministic_reduce
withauto_partitioner
andaffinity_partitioner
. These two partitioners would make it non-deterministic, but a new name likeparallel_pairwise_reduce
can be less confusing.While computing the sum
1+2+3+4+5+6+7+8+9+10+11+12
with 2 threads,simple_partitioner
and grain size 2:parallel_reduce
may compute(((((1+2)+3)+4)+5)+6)+(((((7+8)+9)+10)+11)+12)
parallel_deterministic_reduce
may compute((1+(2+3))+(4+(5+6)))+((7+(8+9))+(10+(11+12)))
For sequence merging,
((1+(2+3))+(4+(5+6)))
is faster than(((((1+2)+3)+4)+5)+6)
. To be more specific,((1+(2+3))+(4+(5+6)))
costs2+2+3+3+6=16
and(((((1+2)+3)+4)+5)+6)
costs2+3+4+5+6=20
. For example, inpstl::partition
andpstl::stable_partition
, replacingparallel_reduce
withparallel_deterministic_reduce
would result in better performance (to make it correct, I had to replaceauto_partitioner
withsimple_partitioner
and set the grain size equal for two versions).For summation of positive floating point numbers, such as calculating 1-norm and 2-norm, pair-wise summation has better floating point accuracy, even if it is non-deterministic. That means
parallel_deterministic_reduce
can produce more accurate result even ifauto_partitioner
oraffinity_partitioner
is used.