Open ouankou opened 1 year ago
Given a reduction example:
int sum = 10000;
#pragma omp target parallel for reduction(+ : sum)
for (int i = 1; i <= 100; i++)
sum += i;
The original ROSE approach is as follows:
sum
within each block. All threads in a block create one value in the array above.It doesn't quite fit REX because:
There could be three solutions:
1. Use OpenMP runtime APIs:
1.1 Similar to the ROSE approach, we can use omp_target_alloc
to allocate a memory block for reduction, but its size should be num_of_blocks num_threads_per_block size_of_data_type. We will complete the reduction entirely on the device.
1.2 After executing the kernel, we use omp_target_memcpy
to copy the reduced final result from the device to the host.
1.3 At last, omp_target_free
releases the temporary memory block on the device.
For this solution, we must add these runtime APIs into the REX header rex_kmp.h
.
int *__reduce_sum = omp_target_alloc(256 * 128 * sizeof(int), device_id);
... // execute the generated kernel
omp_target_memcpy(sum, __reduce_sum, sizeof(int), 0, 0, host_id, device_id); // only copy the first element back
omp_target_free(__reduce_sum, device_id);
2. Use LLVM runtime APIs:
We can also avoid bringing new APIs but use existing ones.
2.1 We use regular malloc
to create the temporary memory block on the host, e.g. __reduce_sum[0:num_of_blocks * num_threads_per_block * size_of_data_type]
.
2.2 Add __reduce_sum
as if it's in a map(from)
clause. Then on the device, a corresponding memory block will be allocated.
2.3 Execute the kernel and reduce sum
to __reduce_sum[0]
.
2.4 While leaving the omp target region, __reduce_sum
will be transferred back to the host, and its very first element is the final result.
For this solution, the data transferring causes extra overhead because we only need the first element of __reduce_sum
instead of the whole array.
int *__reduce_sum = malloc(256 * 128 * sizeof(int));
// add a map(from) clause
#pragma omp target parallel for reduction(+ : sum) map(from: __reduce_sum[0: 256 * 128])
...
sum = __reduce_sum[0];
free(__reduce_sum);
3. Alternative of solution 2:
To optimize the data transfer, we can create a target data
region to enclose the original offloading region. We add a map(alloc)
clause to create __reduce_sum
on the device. After executing the kernel, since it's only allocation, no data will be transferred at all. Before leaving the target data
region, we use target update
to manually copy the final result __reduce_sum[0]
back to the host. It seems a little better, but initializing another GPU operation may trigger a higher overhead than solution 2. Furthermore, the added directives/APIs may conflict with existing ones. e.g. There has been a enclosing target data
region.
int *__reduce_sum = malloc(256 * 128 * sizeof(int));
#pragma omp target data map(alloc:__reduce_sum[0: 256 * 128])
{
... // Execute the kernel
#pragma omp target update from(__reduce_sum[0:1])
}
sum = __reduce_sum[0];
free(__reduce_sum);
Implementation:
reduction
clause can only hold one variable, e.g. reduction(+ : a)
works but reduction(+ : a, b)
. We may address this problem later.
After migrating to UPIR, the transformation for reduction is broken since its implementation is still based on the old IR. First, we need to implement the replacement forWe hold the development based on UPIR. However, the transformation of reduction on GPU still need to be updated.transOmpVariables
, e.g.transUpirVariables
.Reduction on GPU also triggers an implicit data mapping like it's in a
map(tofrom)
clause, even if nomap
clause is specified for the reduction variable.