spcl / dace

DaCe - Data Centric Parallel Programming
http://dace.is/fast
BSD 3-Clause "New" or "Revised" License
498 stars 129 forks source link

Broadcasts to Shared Memory on GPU Runs in Serial #1284

Open computablee opened 1 year ago

computablee commented 1 year ago

Describe the bug When running code on a GPU, if you have a block of shared memory and you broadcast a variable to it, the generated CUDA assigns in serial. In my reproducer, the performance and behavior is identical, but I've encountered a case where this bug led to slower code that was incorrect.

To Reproduce Steps to reproduce the behavior:

@dace.program
def reproduce():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)
        smem[:] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]

    return value

sdfg = reproduce.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())

This code outputs the following CUDA for smem[:] = 0:

__shared__ float smem[1024];
{
    for (auto __i0 = 0; __i0 < 32; __i0 += 1) {
        for (auto __i1 = 0; __i1 < 32; __i1 += 1) {
            {
                float __out;

                ///////////////////
                // Tasklet code (assign_13_8)
                __out = 0.0;
                ///////////////////

                smem[((32 * __i0) + __i1)] = __out;
            }
        }
    }
}

Expected behavior Since the outermost loop is a GPU_Device loop, the broadcast should be a GPU_ThreadBlock loop. Using the following Python code:

@dace.program
def correct():
    value: dace.float32 = 0

    for i, j in dace.map[0:I:I1, 0:J:J1]:
        smem = dace.ndarray((I1, J1), dtype=dace.float32, storage=dace.StorageType.GPU_Shared)

        for k, l in dace.map[0:I1, 0:J1]:
            smem[k, l] = 0.0

        for i1, j1 in dace.map[0:I1, 0:J1]:
            value += smem[i1, j1]

    return value

sdfg = correct.to_sdfg()
sdfg.apply_gpu_transformations()

blockentry = find_map_by_param(sdfg, 'j1')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

blockentry = find_map_by_param(sdfg, 'l')
blockentry.map.schedule = dace.ScheduleType.GPU_ThreadBlock

print(sdfg())

The correct CUDA is generated:

 __shared__ float smem[1024];
{
    {
        {
            int l = threadIdx.x;
            int k = threadIdx.y;
            {
                {
                    {
                        float __out;

                        ///////////////////
                        // Tasklet code (assign_28_12)
                        __out = 0.0;
                        ///////////////////

                        smem[((32 * k) + l)] = __out;
                    }
                }
            }
        }
    }
}

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

alexnick83 commented 1 year ago

Hello, and thank you for reporting this bug. Could you check if the issue has been fixed in the latest master?

computablee commented 1 year ago

As of the current master, the issue appears to persist.

alexnick83 commented 1 year ago

It is possible to get the desired behavior by calling GPUTransform in the following manner:

sdfg.apply_gpu_transformations(sequential_innermaps=False)

The generated code should be:

__global__ void __launch_bounds__(1024) reproduce_19_0_1_0(float * __restrict__ value) {
    {
        {
            int j = (32 * blockIdx.x);
            int i = (32 * blockIdx.y);
            __shared__ float smem[1024];
            {
                {
                    {
                        int __i1 = threadIdx.x;
                        int __i0 = threadIdx.y;
                        {
                            {
                                {
                                    float __out;

                                    ///////////////////
                                    // Tasklet code (assign_21_8)
                                    __out = 0.0;
                                    ///////////////////

                                    smem[((32 * __i0) + __i1)] = __out;
                                }
                            }
                        }
                    }
                }
            }
            __syncthreads();
            {
            ...
computablee commented 1 year ago

That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with sequential_innermaps=False.

alexnick83 commented 1 year ago

That makes sense. I suppose I figure that this should be the default behavior, and you shouldn't have to "opt-in" with sequential_innermaps=False.

It is a bit more complicated because the primary target for the GPU transformation is generic/CPU-based SDFGs. Such SDFGs will have arbitrary Map nests, where the inner Maps may not match any thread-block sizes implied by the outer Maps. This is why the default behavior is to render inner Maps sequentially, which unfortunately doesn't always work well in cases where the SDFG is already partially GPU-transformed.

The bug that you describe above is clear, and we will look into fixing it. Thank you for reporting it!