microsoft / STL

MSVC's implementation of the C++ Standard Library.
Other
10.01k stars 1.47k forks source link

<execution>: Parallelize more algorithms #7

Open BillyONeal opened 4 years ago

BillyONeal commented 4 years ago

Some strategies that can be used:

Just call parallel transform:

Scans:

Same as serial nth_element but call the parallel partition op for large N:

Predicate tests (like all_of):

Summary statistics (like find / find_end):

Divide and conquer:

Divide range1 into chunks, binary search to find matching range2 chunks, scan:

Other:

AlexGuteniev commented 4 years ago

Should generate and generate_n also be parallelized?

Here's how I try to compensate lack of transform with unseq

#include <vector>
#include <execution>

std::vector<int> v_a(16);
std::vector<int> v_b(16);
std::vector<int> v_c(16);

void sum(int* c, int* a, int* b)
{
    std::generate_n(std::execution::unseq, c, 16, [=]() mutable {
        return *(a++) + *(b++);
    });
}

int main()
{
    sum(v_c.data(), v_a.data(), v_b.data());
}
CaseyCarter commented 4 years ago

Should generate and generate_n also be parallelized?

Your sample program is a great demonstration of why generate and generate_n can't be meaningfully parallelized: the function object is almost always stateful, and evaluations almost always order-dependent. I suspect the writer of this program would be very unhappy if the library spawned 16 threads with their own copies of the lambda resulting in all elements of v_c being equal to v_a[0] + v_b[0].

AlexGuteniev commented 4 years ago

Why does execution policy parameter even exist for them?

BillyONeal commented 4 years ago

Why does execution policy parameter even exist for them?

Because the 'design' process for the parallel algorithms was 'if nobody can come up with a reason why it can't be parallelized, without looking at real implementations, in 10 minutes, it gets a parallel one'. Note that partial_sort has such an overload too even though it is a heap algorithm and none of the other heap algorithms got parallel overloads.

AlexGuteniev commented 4 years ago

I'm trying to make this work without #pragma in my code:

#include <vector>

std::vector<int> v_a(16);
std::vector<int> v_b(16);
std::vector<int> v_c(16);

void sum(int c[], int a[], int b[])
{
#pragma loop(ivdep)
    for (int i = 0; i < 16; i++)
    {
        c[i] = a[i] + b[i];
    }
}

int main()
{
    sum(v_c.data(), v_a.data(), v_b.data());
}

Note that without a pragma the compilers still emits SIMD version, but goes to it conditionally at runtime.

Now that this is useless:

void sum(int * c, int* a, int * b)
{
   std::transform(std::execution::unseq, a, a + 16, b, c, std::plus{});
}

As well as this:

void sum(int* c, int* a, int* b)
{
    std::for_each(std::execution::unseq, c, c + 16, [=](int& v) {
        v = a[&v - c] + b[&v - c];
    });
}

How else I can say it?

BillyONeal commented 4 years ago

I don't believe we have a way to say that without a pragma at this time. It's really on the optimizer team if they want to consume the unseq signal and they have declined to do so as of yet.