Open computablee opened 1 year ago
Hello, and thank you for reporting this bug. Could you check if the issue has been fixed in the latest master?
As of the current master, the issue appears to persist.
It is possible to get the desired behavior by calling GPUTransform in the following manner:
sdfg.apply_gpu_transformations(sequential_innermaps=False)
The generated code should be:
__global__ void __launch_bounds__(1024) reproduce_19_0_1_0(float * __restrict__ value) {
{
{
int j = (32 * blockIdx.x);
int i = (32 * blockIdx.y);
__shared__ float smem[1024];
{
{
{
int __i1 = threadIdx.x;
int __i0 = threadIdx.y;
{
{
{
float __out;
///////////////////
// Tasklet code (assign_21_8)
__out = 0.0;
///////////////////
smem[((32 * __i0) + __i1)] = __out;
}
}
}
}
}
}
__syncthreads();
{
...
That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with sequential_innermaps=False
.
That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with
sequential_innermaps=False
.
It is a bit more complicated because the primary target for the GPU transformation is generic/CPU-based SDFGs. Such SDFGs will have arbitrary Map nests, where the inner Maps may not match any thread-block sizes implied by the outer Maps. This is why the default behavior is to render inner Maps sequentially, which unfortunately doesn't always work well in cases where the SDFG is already partially GPU-transformed.
The bug that you describe above is clear, and we will look into fixing it. Thank you for reporting it!
Describe the bug When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.
To Reproduce Steps to reproduce the behavior:
This code outputs the following CUDA for
smem[:] = 0
:Expected behavior Since the outermost loop is a
GPU_Device
loop, the broadcast should be aGPU_ThreadBlock
loop. Using the following Python code:The correct CUDA is generated:
Screenshots If applicable, add screenshots to help explain your problem.
Desktop (please complete the following information):