Update transformation of reduction on GPU

ouankou commented 1 year ago

After migrating to UPIR, the transformation for reduction is broken since its implementation is still based on the old IR. First, we need to implement the replacement for transOmpVariables, e.g. transUpirVariables. We hold the development based on UPIR. However, the transformation of reduction on GPU still need to be updated.

Reduction on GPU also triggers an implicit data mapping like it's in a map(tofrom) clause, even if no map clause is specified for the reduction variable.

ouankou commented 1 year ago

Given a reduction example:

int sum = 10000;
#pragma omp target parallel for reduction(+ : sum)
  for (int i = 1; i <= 100; i++)
    sum += i;

The original ROSE approach is as follows:

Use CUDA API on the host to allocate a memory block on the device (size: num_of_blocks * int)
Execute the kernel and reduce sum within each block. All threads in a block create one value in the array above.
Copy the array back to the host using CUDA API and reduce the array to the final result on the host.

It doesn't quite fit REX because:

We only generate CUDA code for the kernel but not anywhere else.
According to the semantics, in my opinion, the reduction should be completed in one place, either on the host or device, depending on what OpenMP directive is used.

There could be three solutions:

1. Use OpenMP runtime APIs: 1.1 Similar to the ROSE approach, we can use omp_target_alloc to allocate a memory block for reduction, but its size should be num_of_blocks num_threads_per_block size_of_data_type. We will complete the reduction entirely on the device. 1.2 After executing the kernel, we use omp_target_memcpy to copy the reduced final result from the device to the host. 1.3 At last, omp_target_free releases the temporary memory block on the device. For this solution, we must add these runtime APIs into the REX header rex_kmp.h.

int *__reduce_sum = omp_target_alloc(256 * 128 * sizeof(int), device_id);
... // execute the generated kernel
omp_target_memcpy(sum, __reduce_sum, sizeof(int), 0, 0, host_id, device_id); // only copy the first element back
omp_target_free(__reduce_sum, device_id);

2. Use LLVM runtime APIs: We can also avoid bringing new APIs but use existing ones. 2.1 We use regular malloc to create the temporary memory block on the host, e.g. __reduce_sum[0:num_of_blocks * num_threads_per_block * size_of_data_type]. 2.2 Add __reduce_sum as if it's in a map(from) clause. Then on the device, a corresponding memory block will be allocated. 2.3 Execute the kernel and reduce sum to __reduce_sum[0]. 2.4 While leaving the omp target region, __reduce_sum will be transferred back to the host, and its very first element is the final result. For this solution, the data transferring causes extra overhead because we only need the first element of __reduce_sum instead of the whole array.

int *__reduce_sum = malloc(256 * 128 * sizeof(int));
// add a map(from) clause
#pragma omp target parallel for reduction(+ : sum) map(from: __reduce_sum[0: 256 * 128])
...
sum = __reduce_sum[0];
free(__reduce_sum);

3. Alternative of solution 2: To optimize the data transfer, we can create a target data region to enclose the original offloading region. We add a map(alloc) clause to create __reduce_sum on the device. After executing the kernel, since it's only allocation, no data will be transferred at all. Before leaving the target data region, we use target update to manually copy the final result __reduce_sum[0] back to the host. It seems a little better, but initializing another GPU operation may trigger a higher overhead than solution 2. Furthermore, the added directives/APIs may conflict with existing ones. e.g. There has been a enclosing target data region.

int *__reduce_sum = malloc(256 * 128 * sizeof(int));
#pragma omp target data map(alloc:__reduce_sum[0: 256 * 128])
{
... // Execute the kernel
#pragma omp target update from(__reduce_sum[0:1])
}
sum = __reduce_sum[0];
free(__reduce_sum);

ouankou commented 1 year ago

Implementation:

Regarding the reduction inside the kernel, we could use CUDA atomic operations to compute the final result without a temporary reduction buffer array. However, it will make the kernel execute almost sequentially and lower performance a lot.
Our implementation is based on solution 1, which uses OpenMP runtime APIs (commit https://github.com/passlab/rexompiler/commit/073b33e02fb2fa92c2bc44a2eceef4c55df52079).
Currently, ROSE has a limitation that reduction clause can only hold one variable, e.g. reduction(+ : a) works but reduction(+ : a, b). We may address this problem later.

passlab / rexompiler

Update transformation of reduction on GPU #125