uwhpsc-2016 / lectures

Notes, slides, and code from the in-class lectures.
7 stars 21 forks source link

Example manual chunking code #14

Open hughkf opened 8 years ago

hughkf commented 8 years ago

Here is my code to calculate pi, adapted from one of the secondary references ("Introduction to OpenMP" by Tim Mattson on Youtube).

#include <omp.h>
#include <stdio.h>

#define NUM_THREADS 10
#define NUM_STEPS 100000

void main()
{
    int i, nthreads;
    double pi = 0.0, sum[NUM_THREADS], step = 1.0 / (double)NUM_STEPS;

    omp_set_num_threads(NUM_THREADS);

    #pragma omp parallel
    {
        int i, id, nthrds;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if (id == 0) nthreads = nthrds;

        /* round-robin with chunk = NUM_STEPS / NUM_THREADS */
        for (i = id, sum[id] = 0.0; i < NUM_STEPS; i += nthrds)
        {
            sum[id] += 4.0 / (1.0 + (i + 0.5) * step* (i + 0.5) * step);
        }
    }
    for (i = 0; i < nthreads; i++) pi += sum[i] * step;
    printf("pi: %f\n", pi);
}
cswiercz commented 8 years ago

Looks good! Thank you for providing this example! (That video series is great an, actually, I will be working through this example in class for Lecture 12. So...you're already one lecture ahead!)

I wrote a similar "manual chunking" code during Lecture 11 where the loop step size was the number of threads. I like the name "round-robin" for this kind of array access.

Though...not to jump into picking apart your code...the round-robin access pattern is not cache efficient within each thread if you're using it to access elements of an array: thread id will require the elements

a[id], a[id + nthreads], a[id + 2*nthreads], a[id + 3*nthreads], ...

If nthreads is large enough then each access will result in a cache miss!

hughkf commented 8 years ago

In that case, we need to make the chunks contiguous, right? This following, updated code attempts to avoid cache misses by making the iteration chunks contiguous.

#include <omp.h>
#include <stdio.h>

#define NUM_THREADS 10
#define NUM_STEPS 100000
#define CHUNK NUM_STEPS / NUM_THREADS

void main()
{
    int i, nthreads;
    double pi = 0.0, sum[NUM_THREADS], step = 1.0 / (double)NUM_STEPS;

    omp_set_num_threads(NUM_THREADS);

    #pragma omp parallel
    {
        int i, id, nthrds;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();
        if (id == 0) nthreads = nthrds;

        /* continguous with chunk = NUM_STEPS / NUM_THREADS */
        for (i = id * CHUNK, sum[id] = 0.0; i < (id + 1) * CHUNK; i++)
        {
            sum[id] += 4.0 / (1.0 + (i + 0.5) * step* (i + 0.5) * step);
        }
    }
    for (i = 0; i < nthreads; i++) pi += sum[i] * step;
    printf("pi: %f\n", pi);
}
hughkf commented 8 years ago

Although, this only works if NUM_STEPS % NUM_THREADS == 0.