gpgpu-sim / gpgpu-sim_distribution

GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
Other
1.15k stars 513 forks source link

Incorrect results produced by warp shuffles in gpgpu-sim #230

Open ueqri opened 3 years ago

ueqri commented 3 years ago

Hello everyone,

I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.

Code snippet: The minimal code comes from the official tutorial of CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-broadcast.

#include <stdio.h>

__global__ void bcast(int arg) {
    int laneId = threadIdx.x & 0x1f;
    int value;
    if (laneId == 0)        // Note unused variable for
        value = arg;        // all threads except lane 0
    value = __shfl_sync(0xffffffff, value, 0);   // Synchronize all threads in warp, and get "value" from lane 0
    if (value != arg)
        printf("Thread %d failed.\n", threadIdx.x);
}

int main() {
    bcast<<< 1, 32 >>>(1234);
    cudaDeviceSynchronize();

    return 0;
}

Build environment: I used is the image jonghyun1215/gpgpu:gpgpusim4 from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:90ec33997, exactly the latest commit in branch dev), CUDA 10.1.

Situation: The sample code shouldn't print the failed message and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the results are wrong. :confused:

Investigation: I turned to other warp samples like shfl_down_sync, shfl_xor_sync in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.

Possible Parts: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here: link. Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.

I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration! :relaxed:

ueqri commented 3 years ago

Thank @mkhairy for the reply of this issue :relaxed:, please see here: https://groups.google.com/g/accel-sim/c/SxtFMYrshXg/m/pTYTsZesAQAJ