Open BillyONeal opened 5 years ago
Should generate
and generate_n
also be parallelized?
Here's how I try to compensate lack of transform
with unseq
#include <vector>
#include <execution>
std::vector<int> v_a(16);
std::vector<int> v_b(16);
std::vector<int> v_c(16);
void sum(int* c, int* a, int* b)
{
std::generate_n(std::execution::unseq, c, 16, [=]() mutable {
return *(a++) + *(b++);
});
}
int main()
{
sum(v_c.data(), v_a.data(), v_b.data());
}
Should
generate
andgenerate_n
also be parallelized?
Your sample program is a great demonstration of why generate
and generate_n
can't be meaningfully parallelized: the function object is almost always stateful, and evaluations almost always order-dependent. I suspect the writer of this program would be very unhappy if the library spawned 16 threads with their own copies of the lambda resulting in all elements of v_c
being equal to v_a[0] + v_b[0]
.
Why does execution policy parameter even exist for them?
Why does execution policy parameter even exist for them?
Because the 'design' process for the parallel algorithms was 'if nobody can come up with a reason why it can't be parallelized, without looking at real implementations, in 10 minutes, it gets a parallel one'. Note that partial_sort
has such an overload too even though it is a heap algorithm and none of the other heap algorithms got parallel overloads.
I'm trying to make this work without #pragma
in my code:
#include <vector>
std::vector<int> v_a(16);
std::vector<int> v_b(16);
std::vector<int> v_c(16);
void sum(int c[], int a[], int b[])
{
#pragma loop(ivdep)
for (int i = 0; i < 16; i++)
{
c[i] = a[i] + b[i];
}
}
int main()
{
sum(v_c.data(), v_a.data(), v_b.data());
}
Note that without a pragma the compilers still emits SIMD version, but goes to it conditionally at runtime.
Now that this is useless:
void sum(int * c, int* a, int * b)
{
std::transform(std::execution::unseq, a, a + 16, b, c, std::plus{});
}
As well as this:
void sum(int* c, int* a, int* b)
{
std::for_each(std::execution::unseq, c, c + 16, [=](int& v) {
v = a[&v - c] + b[&v - c];
});
}
How else I can say it?
I don't believe we have a way to say that without a pragma at this time. It's really on the optimizer team if they want to consume the unseq signal and they have declined to do so as of yet.
Some strategies that can be used:
Just call parallel
transform
:replace_copy
replace_copy_if
Scans:
copy_if
partition_copy
remove_copy
remove_copy_if
unique
unique_copy
Same as serial
nth_element
but call the parallel partition op for large N:nth_element
Predicate tests (like
all_of
):lexicographical_compare
Summary statistics (like
find
/find_end
):min_element
max_element
minmax_element
Divide and conquer:
inplace_merge
stable_partition
Divide range1 into chunks, binary search to find matching range2 chunks, scan:
merge
set_symmetric_difference
set_union
Other:
includes