CUDA on Windows: COMDAT folding may cause wrong kernel to be launched

mkuron commented 3 months ago

Summary

CUDA's kernel launch mechanism requires each kernel's device stub function to have a unique address. When targeting Windows, the linker defaults to performing identical COMDAT folding (ICF). Since the device stubs tend to be identical when the kernel arguments are the same, this deduplication can lead to non-unique addresses. In the end, attempting to invoke any of the kernels whose device stubs were folded into one will always invoke the first of these kernels.

Version history

@kpyzhov originally fixed this problem in 15e678e8438c56703c775fcb45afb25e09b31cff (Clang 9.0) by removing the COMDAT attribute from kernels and device stubs. @yxsamliu then introduced a regression in 80072fde61d40a4e8a9da673476730d34a483fa2 (Clang 14.0) that reallows COMDAT attributes on device stubs.

Minimal working example

#include <stdio.h>
#include <cuda.h>

template<typename F>
__global__
void forEach(F f, int n) {
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < n) {
        f(i);
    }
}

void check(cudaError_t err, unsigned line) {
    if (err != cudaSuccess) {
        printf("CUDA error in line %d: %s\n", line, cudaGetErrorName(err));
        exit(1);
    }
}
#define CHECK(err) check(err, __LINE__)

struct Test {
  void bad() {
    float *data = nullptr;

    forEach<<<1,1>>>([data] __host__ __device__ (int i) {
#ifdef __CUDA_ARCH__
        __trap(); 
#endif
    }, 1);
  }

  void good() {
    float *data;
    CHECK(cudaMalloc(reinterpret_cast<void**>(&data), sizeof(float)));
    CHECK(cudaMemset(data, 0xff, sizeof(float)));

    forEach<<<1,1>>>([data] __host__ __device__ (int i) {
        data[i] = 1.0f;
    }, 1);
    CHECK(cudaDeviceSynchronize());

    float host;
    CHECK(cudaMemcpy(&host, data, sizeof(float), cudaMemcpyDeviceToHost));
    if (host == 1.0f) {
        printf("success\n");
    } else {
        printf("failure\n");
    }

    CHECK(cudaFree(data));
  }
};

int main(int argc, char** argv)
{
    Test t;
    if (argc > 10) {
        t.bad();
    } else {
        t.good();
    }
    return 0;
}

Reproduction

Compile the above code with clang.exe --target=x86_64-pc-windows-msvc -std=c++17 -O3 --cuda-gpu-arch=sm_75 -lcudart -fuse-ld=lld-link.exe -o test.exe test.cu and then execute the binary without any command-line arguments. Observe a cudaErrorLaunchFailure coming from cudaDeviceSynchronize(). The issue can also be observed without -fuse-ld=lld-link.exe.

Now, add -Wl,/OPT:NOICF and observe that the error goes away. This is a fine workaround for the present bug, but obviously ICF may be desirable outside of CUDA device stubs.

For comparison, compile with NVCC (nvcc.exe -std=c++17 -O3 -gencode arch=compute_75,code=sm_75 -cudart shared -extended-lambda -o test_nvcc.exe test.cu) and observe that it runs successfully.

I've also posted the code to Godbolt: https://cuda.godbolt.org/z/15s87KPz6. Here you can see that Clang trunk and 8.0 generates the device stub as define linkonce_odr dso_local ... comdat, while Clang 11.0 generates it as define linkonce_odr dso_local ....

yxsamliu commented 3 months ago

Thanks. I will take a look.

mkuron commented 3 months ago

I just came across #57178, which is the same problem. However, that ticket did not realize that this is Clang's fault and not the linker's (because Clang can choose not to generate comdat attributes on device stubs), that it's a regression introduced in Clang 14, or that the issues does not occur with NVCC. So I think it's a valid bug report even though the other issue got closed.

yxsamliu commented 3 months ago

It seems the issue is CUDA specific since the same test ported to HIP passes.

I can disable comdat for CUDA only, however, this may cause duplicate symbols when the same template instantiation happens in different TU's in -fgpu-rdc mode. At least, I think we should disable for CUDA for the default -fno-gpu-rdc case since it should not cause other issues. @Artem-B @rnk

mkuron commented 3 months ago

It seems the issue is CUDA specific since the same test ported to HIP passes.

Looking at https://cuda.godbolt.org/z/jx59oeYTh, it seems like HIP also has comdat attributes on its device stubs and the assembly for the stubs for the two kernels looks identical, so I think HIP should be susceptible to the same issue.

I can disable comdat for CUDA only, however, this may cause duplicate symbols when the same template instantiation happens in different TU's in -fgpu-rdc mode.

Good point, though I would still prefer a solution that also applies to -fgpu-rdc mode. Is there a way to restrict COMDAT folding to only symbols with identical names? How does NVCC solve this problem?

Artem-B commented 3 months ago

I just came across #57178, which is the same problem. However, that ticket did not realize that this is Clang's fault and not the linker's (because Clang can choose not to generate comdat attributes on device stubs),

Can you elaborate? clang does not seem to generate comdat on kernel stubs for CUDA: https://cuda.godbolt.org/z/dWqe65rK4

ICF issues on windows, IIRC, were caused by the stubs' code for different kernels being identical which microsoft linker decided to fold into one function.

mkuron commented 3 months ago

Can you elaborate? clang does not seem to generate comdat on kernel stubs for CUDA

It doesn't always do. My sample, https://cuda.godbolt.org/z/15s87KPz6 has define linkonce_odr dso_local ... comdat, yours has define dso_local .... If I add static to the beginning of line 6 of my sample, I get define internal ... (without comdat).

Artem-B commented 3 months ago

For linking & windows we may need @rnk and @MaskRay expertise here.

rnk commented 3 months ago

Frankly, CUDA is just incompatible with ICF. Clang is putting comdats in the right place according to normal C++ language rules. Also, doesn't ICF fire even on non-comdat functions if /Gy / -ffunction-sections is used, which normally happens via /O2?

Feel free to go back to removing comdat from CUDA kernel wrappers, but you're just trading one kind of buggy behavior (ICF merged kernel stubs) for another kind of buggy behavior (duplicate symbol errors from vague linkage template instantiations). A more complete solution would be to incorporate an ICF-blocking device into every kernel stub, like incrementing a static volatile local variable or something.

Artem-B commented 3 months ago

incorporate an ICF-blocking device into every kernel stub

This should be doable. I'm curious what NVIDIA does. They should have the same problem.

yxsamliu commented 3 months ago

It seems the issue is CUDA specific since the same test ported to HIP passes.

Looking at https://cuda.godbolt.org/z/jx59oeYTh, it seems like HIP also has comdat attributes on its device stubs and the assembly for the stubs for the two kernels looks identical, so I think HIP should be susceptible to the same issue.

I debugged the HIP program with MSVC and I can see the two kernel stub functions have different address:

__device_stub__forEach<`Test::good'::`1'::<lambda_1> > (07FF62EAC322Eh)

__device_stub__forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EAC3102h)

Also, since HIP registers the kernels with kernel symbol variables instead of kernel stub functions, which also have different addresses for the two different kernels:

forEach<`Test::good'::`1'::<lambda_1> > (07FF62EB43068h)

forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EB43058h)

Basically, when a kernel is launched, its stub function is called, in which its kernel symbol is passed to hipLaunchKernel, which is used to find the real kernel address in device executable. Since both kernel stub functions and kernel symbols are different, HIP is able to find the correct kernel address in device excutable.

both kernel stub functions and kernel symbol variables are in comdat. The mangled names are different for the two kernels. I did not see the linker merge them.

yxsamliu commented 3 months ago

It seems the issue is CUDA specific since the same test ported to HIP passes.

Looking at https://cuda.godbolt.org/z/jx59oeYTh, it seems like HIP also has comdat attributes on its device stubs and the assembly for the stubs for the two kernels looks identical, so I think HIP should be susceptible to the same issue.

I debugged the HIP program with MSVC and I can see the two kernel stub functions have different address:
__device_stub__forEach<`Test::good'::`1'::<lambda_1> > (07FF62EAC322Eh)

__device_stub__forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EAC3102h)
Also, since HIP registers the kernels with kernel symbol variables instead of kernel stub functions, which also have different addresses for the two different kernels:
forEach<`Test::good'::`1'::<lambda_1> > (07FF62EB43068h)

forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EB43058h)
Basically, when a kernel is launched, its stub function is called, in which its kernel symbol is passed to hipLaunchKernel, which is used to find the real kernel address in device executable. Since both kernel stub functions and kernel symbols are different, HIP is able to find the correct kernel address in device excutable.

both kernel stub functions and kernel symbol variables are in comdat. The mangled names are different for the two kernels. I did not see the linker merge them.

I just realized ICF merges symbols by content, not by name. I guess they are not merged for HIP because they are different for HIP. Whereas for CUDA the kernel stub may be the same.

yxsamliu commented 3 months ago

incorporate an ICF-blocking device into every kernel stub

This should be doable. I'm curious what NVIDIA does. They should have the same problem.

To prevent ICF merge two kernel stubs, we need to make them different and make sure the difference won't be optimized out. One thing I can think of, is to create a global variable containing the mangled name of the kernel, and write to that variable. However, this will have slight overhead.

rnk commented 3 months ago

Also, since HIP registers the kernels with kernel symbol variables instead of kernel stub functions, which also have different addresses for the two different kernels

Yes, exactly, ICF doesn't apply to mutable data, it only applies to readonly data and code, so using kernel stub variables would make a big difference. As you suggest, you can achieve the same effect by having the CUDA kernel stubs touch some mutable variables, and that will block ICF. Inserting code equivalent to static volatile int block_icf; ++block_icf; should be sufficient.

Artem-B commented 3 months ago

Would it be sufficient to add some unique bytes (e.g. function name) in the body of the function (i.e. in .text) after return as an inline asm?

yxsamliu commented 3 months ago

Would it be sufficient to add some unique bytes (e.g. function name) in the body of the function (i.e. in .text) after return as an inline asm?

After return probably won't work, but jump over the bytes works https://godbolt.org/z/6es59shv1

However, inline asm is target-specific.

llvmbot commented 2 months ago

@llvm/issue-subscribers-clang-codegen

Author: Michael Kuron (mkuron)

**Summary** CUDA's kernel launch mechanism requires each kernel's device stub function to have a unique address. When targeting Windows, the linker defaults to performing identical COMDAT folding (ICF). Since the device stubs tend to be identical when the kernel arguments are the same, this deduplication can lead to non-unique addresses. In the end, attempting to invoke any of the kernels whose device stubs were folded into one will always invoke the first of these kernels. **Version history** @kpyzhov originally fixed this problem in 15e678e8438c56703c775fcb45afb25e09b31cff (Clang 9.0) by removing the COMDAT attribute from kernels and device stubs. @yxsamliu then introduced a regression in 80072fde61d40a4e8a9da673476730d34a483fa2 (Clang 14.0) that reallows COMDAT attributes on device stubs. **Minimal working example** ```cuda #include <stdio.h> #include <cuda.h> template<typename F> __global__ void forEach(F f, int n) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) { f(i); } } void check(cudaError_t err, unsigned line) { if (err != cudaSuccess) { printf("CUDA error in line %d: %s\n", line, cudaGetErrorName(err)); exit(1); } } #define CHECK(err) check(err, __LINE__) struct Test { void bad() { float *data = nullptr; forEach<<<1,1>>>([data] __host__ __device__ (int i) { #ifdef __CUDA_ARCH__ __trap(); #endif }, 1); } void good() { float *data; CHECK(cudaMalloc(reinterpret_cast<void**>(&data), sizeof(float))); CHECK(cudaMemset(data, 0xff, sizeof(float))); forEach<<<1,1>>>([data] __host__ __device__ (int i) { data[i] = 1.0f; }, 1); CHECK(cudaDeviceSynchronize()); float host; CHECK(cudaMemcpy(&host, data, sizeof(float), cudaMemcpyDeviceToHost)); if (host == 1.0f) { printf("success\n"); } else { printf("failure\n"); } CHECK(cudaFree(data)); } }; int main(int argc, char** argv) { Test t; if (argc > 10) { t.bad(); } else { t.good(); } return 0; } ``` **Reproduction** Compile the above code with `clang.exe --target=x86_64-pc-windows-msvc -std=c++17 -O3 --cuda-gpu-arch=sm_75 -lcudart -fuse-ld=lld-link.exe -o test.exe test.cu` and then execute the binary without any command-line arguments. Observe a `cudaErrorLaunchFailure` coming from `cudaDeviceSynchronize()`. The issue can also be observed without `-fuse-ld=lld-link.exe`. Now, add `-Wl,/OPT:NOICF` and observe that the error goes away. This is a fine workaround for the present bug, but obviously ICF may be desirable outside of CUDA device stubs. For comparison, compile with NVCC (`nvcc.exe -std=c++17 -O3 -gencode arch=compute_75,code=sm_75 -cudart shared -extended-lambda -o test_nvcc.exe test.cu`) and observe that it runs successfully. I've also posted the code to Godbolt: https://cuda.godbolt.org/z/15s87KPz6. Here you can see that Clang trunk and 8.0 generates the device stub as `define linkonce_odr dso_local ... comdat`, while Clang 11.0 generates it as `define linkonce_odr dso_local ...`.

llvm / llvm-project

CUDA on Windows: COMDAT folding may cause wrong kernel to be launched #88883