Closed mkuron closed 2 months ago
Thanks. I will take a look.
I just came across #57178, which is the same problem. However, that ticket did not realize that this is Clang's fault and not the linker's (because Clang can choose not to generate comdat
attributes on device stubs), that it's a regression introduced in Clang 14, or that the issues does not occur with NVCC. So I think it's a valid bug report even though the other issue got closed.
It seems the issue is CUDA specific since the same test ported to HIP passes.
I can disable comdat for CUDA only, however, this may cause duplicate symbols when the same template instantiation happens in different TU's in -fgpu-rdc mode. At least, I think we should disable for CUDA for the default -fno-gpu-rdc case since it should not cause other issues. @Artem-B @rnk
It seems the issue is CUDA specific since the same test ported to HIP passes.
Looking at https://cuda.godbolt.org/z/jx59oeYTh, it seems like HIP also has comdat
attributes on its device stubs and the assembly for the stubs for the two kernels looks identical, so I think HIP should be susceptible to the same issue.
I can disable comdat for CUDA only, however, this may cause duplicate symbols when the same template instantiation happens in different TU's in -fgpu-rdc mode.
Good point, though I would still prefer a solution that also applies to -fgpu-rdc
mode. Is there a way to restrict COMDAT folding to only symbols with identical names? How does NVCC solve this problem?
I just came across #57178, which is the same problem. However, that ticket did not realize that this is Clang's fault and not the linker's (because Clang can choose not to generate
comdat
attributes on device stubs),
Can you elaborate? clang does not seem to generate comdat on kernel stubs for CUDA: https://cuda.godbolt.org/z/dWqe65rK4
ICF issues on windows, IIRC, were caused by the stubs' code for different kernels being identical which microsoft linker decided to fold into one function.
Can you elaborate? clang does not seem to generate comdat on kernel stubs for CUDA
It doesn't always do. My sample, https://cuda.godbolt.org/z/15s87KPz6 has define linkonce_odr dso_local ... comdat
, yours has define dso_local ...
. If I add static
to the beginning of line 6 of my sample, I get define internal ...
(without comdat
).
For linking & windows we may need @rnk and @MaskRay expertise here.
Frankly, CUDA is just incompatible with ICF. Clang is putting comdats in the right place according to normal C++ language rules. Also, doesn't ICF fire even on non-comdat functions if /Gy
/ -ffunction-sections
is used, which normally happens via /O2
?
Feel free to go back to removing comdat from CUDA kernel wrappers, but you're just trading one kind of buggy behavior (ICF merged kernel stubs) for another kind of buggy behavior (duplicate symbol errors from vague linkage template instantiations). A more complete solution would be to incorporate an ICF-blocking device into every kernel stub, like incrementing a static volatile local variable or something.
incorporate an ICF-blocking device into every kernel stub
This should be doable. I'm curious what NVIDIA does. They should have the same problem.
It seems the issue is CUDA specific since the same test ported to HIP passes.
Looking at https://cuda.godbolt.org/z/jx59oeYTh, it seems like HIP also has
comdat
attributes on its device stubs and the assembly for the stubs for the two kernels looks identical, so I think HIP should be susceptible to the same issue.
I debugged the HIP program with MSVC and I can see the two kernel stub functions have different address:
__device_stub__forEach<`Test::good'::`1'::<lambda_1> > (07FF62EAC322Eh)
__device_stub__forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EAC3102h)
Also, since HIP registers the kernels with kernel symbol variables instead of kernel stub functions, which also have different addresses for the two different kernels:
forEach<`Test::good'::`1'::<lambda_1> > (07FF62EB43068h)
forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EB43058h)
Basically, when a kernel is launched, its stub function is called, in which its kernel symbol is passed to hipLaunchKernel, which is used to find the real kernel address in device executable. Since both kernel stub functions and kernel symbols are different, HIP is able to find the correct kernel address in device excutable.
both kernel stub functions and kernel symbol variables are in comdat. The mangled names are different for the two kernels. I did not see the linker merge them.
It seems the issue is CUDA specific since the same test ported to HIP passes.
Looking at https://cuda.godbolt.org/z/jx59oeYTh, it seems like HIP also has
comdat
attributes on its device stubs and the assembly for the stubs for the two kernels looks identical, so I think HIP should be susceptible to the same issue.I debugged the HIP program with MSVC and I can see the two kernel stub functions have different address:
__device_stub__forEach<`Test::good'::`1'::<lambda_1> > (07FF62EAC322Eh) __device_stub__forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EAC3102h)
Also, since HIP registers the kernels with kernel symbol variables instead of kernel stub functions, which also have different addresses for the two different kernels:
forEach<`Test::good'::`1'::<lambda_1> > (07FF62EB43068h) forEach<`Test::bad'::`1'::<lambda_1> > (07FF62EB43058h)
Basically, when a kernel is launched, its stub function is called, in which its kernel symbol is passed to hipLaunchKernel, which is used to find the real kernel address in device executable. Since both kernel stub functions and kernel symbols are different, HIP is able to find the correct kernel address in device excutable.
both kernel stub functions and kernel symbol variables are in comdat. The mangled names are different for the two kernels. I did not see the linker merge them.
I just realized ICF merges symbols by content, not by name. I guess they are not merged for HIP because they are different for HIP. Whereas for CUDA the kernel stub may be the same.
incorporate an ICF-blocking device into every kernel stub
This should be doable. I'm curious what NVIDIA does. They should have the same problem.
To prevent ICF merge two kernel stubs, we need to make them different and make sure the difference won't be optimized out. One thing I can think of, is to create a global variable containing the mangled name of the kernel, and write to that variable. However, this will have slight overhead.
Also, since HIP registers the kernels with kernel symbol variables instead of kernel stub functions, which also have different addresses for the two different kernels
Yes, exactly, ICF doesn't apply to mutable data, it only applies to readonly data and code, so using kernel stub variables would make a big difference. As you suggest, you can achieve the same effect by having the CUDA kernel stubs touch some mutable variables, and that will block ICF. Inserting code equivalent to static volatile int block_icf; ++block_icf;
should be sufficient.
Would it be sufficient to add some unique bytes (e.g. function name) in the body of the function (i.e. in .text) after return as an inline asm?
Would it be sufficient to add some unique bytes (e.g. function name) in the body of the function (i.e. in .text) after return as an inline asm?
After return probably won't work, but jump over the bytes works https://godbolt.org/z/6es59shv1
However, inline asm is target-specific.
@llvm/issue-subscribers-clang-codegen
Author: Michael Kuron (mkuron)
Summary
CUDA's kernel launch mechanism requires each kernel's device stub function to have a unique address. When targeting Windows, the linker defaults to performing identical COMDAT folding (ICF). Since the device stubs tend to be identical when the kernel arguments are the same, this deduplication can lead to non-unique addresses. In the end, attempting to invoke any of the kernels whose device stubs were folded into one will always invoke the first of these kernels.
Version history
@kpyzhov originally fixed this problem in 15e678e8438c56703c775fcb45afb25e09b31cff (Clang 9.0) by removing the COMDAT attribute from kernels and device stubs. @yxsamliu then introduced a regression in 80072fde61d40a4e8a9da673476730d34a483fa2 (Clang 14.0) that reallows COMDAT attributes on device stubs.
Minimal working example
Reproduction
Compile the above code with
clang.exe --target=x86_64-pc-windows-msvc -std=c++17 -O3 --cuda-gpu-arch=sm_75 -lcudart -fuse-ld=lld-link.exe -o test.exe test.cu
and then execute the binary without any command-line arguments. Observe acudaErrorLaunchFailure
coming fromcudaDeviceSynchronize()
. The issue can also be observed without-fuse-ld=lld-link.exe
.Now, add
-Wl,/OPT:NOICF
and observe that the error goes away. This is a fine workaround for the present bug, but obviously ICF may be desirable outside of CUDA device stubs.For comparison, compile with NVCC (
nvcc.exe -std=c++17 -O3 -gencode arch=compute_75,code=sm_75 -cudart shared -extended-lambda -o test_nvcc.exe test.cu
) and observe that it runs successfully.I've also posted the code to Godbolt: https://cuda.godbolt.org/z/15s87KPz6. Here you can see that Clang trunk and 8.0 generates the device stub as
define linkonce_odr dso_local ... comdat
, while Clang 11.0 generates it asdefine linkonce_odr dso_local ...
.