Problem

Given an omp target region with nested SPMD regions, we launch a parent kernel for the target region and child kernels for enclosed SPMD regions.

In the following example, k and k2 are implicitly shared in the second and third parallel for regions. They need to be passed to the child kernel. Please ignore the data race for sum and sum2 for simplicity.

void foo(int n) {
  int sum = 10000;
  int sum2 = 10;
#pragma omp target teams map(tofrom : sum) num_teams(n)
  { // target region starts
#pragma omp parallel for
    for (int i = 1; i <= 80; i++) {
      sum2 += i;
    }

    int k = 10;
#pragma omp parallel for
    for (int i = 1; i <= 100; i++)
      sum += i + k;

    int *k2 = &k;
    for (int i = 0; i < 10; i++)
      k += i;

#pragma omp parallel for
    for (int i = 1; i <= 100; i++)
      sum += i + *k2;
  } // target region ends

  printf("sum = %d\n", sum);
}

Solution discussion for REX

One solution is that REX declares an array for all shared variables. The array size is the number of threads in the enclosing region. In this case, it's n. Then each thread accesses the original k with the corresponding thread id in this global array. According to the OpenMP spec, the number of teams could be determined at runtime. However, to declare variables in the global scope, n has to be a constant. Therefore, this solution doesn't always work.

Another solution is that REX creates a local copy of k, say k_p__, on the heap using the CUDA version malloc. Then we pass the local copy k_p__ to the child kernel. However, this approach may not work for pointers. For example, k2 points to a local memory address. Creating a copy of k2 on the heap doesn't help. We need to track back to the original variable, in this case, k.

This approach is probably difficult because: it could be a long path between associating k2 with k and passing k2 to the child kernel. We have to create the heap copy of k from the very beginning and replace all the usage of k with its heap copy k_p__. In this way, we guarantee k2 always holds the expected value.

Solution in LLVM

LLVM doesn't use dynamic parallelism, all threads are launched at the beginning. A state machine controls which part of them joins the computing. At the LLVM IR level, there is no difference between shared and purely local variables. Generic alloca instruction handles both types of variables. However, the backend will generate different codes for them.

For purely local variables, they are allocated on the local stack. For shared variables, they are allocated in the shared memory. In the examples above, given the same int k = 10;, if k is not used in any SPMD region, it's on the stack. Otherwise, it's created in the shared memory. If the size of shared variables is too large to put in the shared memory, they will be created on the global memory instead.

Compare two solutions for REX

Since only data in the global memory are allowed to pass to the child kernel, we can't use shared memory as LLVM does. Global heap is the only choice. The question is whether we should:

create a local copy on the heap and synchronize the copy with the original variable on the stack;
or modify the original declaration to use the heap directly instead of the stack.

Solution 1 would be more resource-efficient because the heap is used only if necessary. Solution 2 would be easier to implement because we don't need to worry about the synchronization between the original variable and copies.

For the example above, the transformed code should be like this:

Solution 1

__global__ void outlined_target_teams(int *sum_p__, int *sum2_p__) {

  outlined_func_parallel_1<<<1, 128>>>(sum2_p__);

  int k = 10;
  {
    int *local_k_p__ = (int *)malloc(sizeof(int));
    *local_k_p__ = k;
    outlined_func_parallel_2<<<1, 128>>>(sum_p__, local_k_p__);
    k = *local_k_p__;
    free(local_k_p__);
  }

  int *local_k_p__ = (int *)malloc(sizeof(int));
  *local_k_p__ = k;
  int *k2 = &(*local_k_p__);
  for (int i = 0; i < 10; i++)
    *local_k_p__ += i;

  {
    outlined_func_parallel_3<<<1, 128>>>(sum_p__, k2);
  }
  k = *local_k_p__;
  k2 = &k;
  free(local_k_p__);

  printf("sum = %d\n", sum);
}

In this example, the transformed code above should be OK. However, we were using the local copy on the heap for computing all the time. How should we synchronize that to the original k and other involved pointers? There could be high risks in creating dangling pointers. Some other pointers not shared in the SPMD region are still pointing to the heap after the local copy is freed.

Solution 2

__global__ void outlined_target_teams(int *sum_p__, int *sum2_p__) {

  outlined_func_parallel_1<<<1, 128>>>(sum2_p__);

  int *k_p__ = (int *)malloc(sizeof(int));
  *k_p__ = 10;

  outlined_func_parallel_2<<<1, 128>>>(sum_p__, k_p__);

  int *k2 = &(*k_p__);
  for (int i = 0; i < 10; i++)
    *k_p__ += i;

  outlined_func_parallel_3<<<1, 128>>>(sum_p__, k2);

  printf("sum = %d\n", sum);
  free(k_p__);
}

Thoughts

To me, solution 2 seems better. It does change the semantics of the original code (stack declaration to heap declaration). The performance is worse due to using global memory instead of local memory. However, it's easier for compiler transformation and more robust than solution 1.

passlab / rexompiler

Data sharing in dynamic parallelism #153