Open yanan-sjh opened 2 months ago
Reproducer on godbolt. https://godbolt.org/z/MzGbefhnK
Dynamic stack allocation support has been added in LLVM in https://github.com/llvm/llvm-project/pull/84585
but GPU-side alloca
support has not been plumbed through to use it for lowering @llvm.stacksave
intrinsic yet.
Reproducer on godbolt. https://godbolt.org/z/MzGbefhnK
Dynamic stack allocation support has been added in LLVM in #84585 but GPU-side
alloca
support has not been plumbed through to use it for lowering@llvm.stacksave
intrinsic yet.
Okay, I understand, thank you for your reply. By the way, I have a question: when compiling some CUDA code with clang18, it gets stuck at the following point and takes a long time without any result. Is this normal?
15 warnings generated when compiling for sm_89.
ptxas warning : Value of threads per SM for entry _Z5entrydPdS_PimPxi is out of range. .minnctapersm will be ignored
I am using a CUDA code generator to create some CUDA programs (which are usually quite complex, but nvcc can handle them normally). However, when I compile them with clang++-18, I encounter several problems. If you're interested, I can simplify these programs and share them with you.
If you run clang compilation with -v
you should see which stage of the compilation gets stuck. Considering that there's a ptxas warning, I suspect it's ptxas, which means there's probably not much we can do other than tweak compilation options and see if that may avoid particular PTX pattern ptxas may be unhappy about. It's hard to tell what exactly is the problem.
That said, I'd start with the warning about .minnctapersm
. I suspect something in the source code passed an out-of-bounds value to __launch_bounds__
. It may or may not have anything to do with the slow compilation, but it would be good to get rid of the issue so it does not complicate things further.
@Artem-B I have followed your advice and adjusted the __launch_bounds__
parameters. The ptxas warning has disappeared, but the compilation time for my CUDA program remains quite long. Could you please explain what factors influence the duration of the ptxas phase? Thank you.
15 warnings generated when compiling for host.
"/usr/bin/ld" -z relro --hash-style=gnu --build-id --eh-frame-hdr -m elf_x86_64 -pie -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o clang.bin /usr/lib/x86_64-linux-gnu/Scrt1.o /usr/lib/x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/13/crtbeginS.o -L/usr/local/cuda/lib64 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/13 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/lib -L/usr/lib /tmp/host-64c66e.o /tmp/kernel-05203d.o -lcudart_static -ldl -lrt -lstdc++ -lm -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc /usr/bin/../lib/gcc/x86_64-linux-gnu/13/crtendS.o /usr/lib/x86_64-linux-gnu/crtn.o
clang++-18 -v host.cu kernel.cu -o clang.bin --cuda-gpu-arch=sm_89 -ldl -lr 364.43s user 0.91s system 99% cpu 6:05.46 total
Could you please explain what factors influence the duration of the ptxas phase
I have as much visibility into ptxas as everybody else outside of NVIDIA -- none. I can't even give you a good guess, never mind explain what exactly slows ptxas down. It's known to happen now and then, but so far pretty much all of the cases I did happen to look at closely, each had its own unique root cause. Sometimes it was specific to the ptxas version (try different CUDA versions?), sometimes it didn't like the loop structures clang generated (tweaking some LLVM parameters helped in that particular case), sometimes the user tried to compile the code with incredibly large number of small functions that did end up in PTX (putting them into anonymous namespace allowed them to be eliminated before they made it to PTX).
Obviously, the absolute size of PTX input would be a factor, but other than that I can not tell what may be happening in your case.
Thank you very much for your response. I will try to identify the cause of the issue by reducing the length of the code.
I encountered an error while compiling CUDA code using clang++-18. The error message is as follows:
The kernel code is as follows. The following code can be successfully compiled using nvcc.
The CUDA version is 12.1.The program consists of two parts:
kernel.cu
andhost.cu
and it is compiled using the following command:All files are attached.
[Uploading fatal_error.zip…]()