llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.92k stars 11.52k forks source link

An error occurred in the backend when compiling CUDA code with clang18 #97898

Open yanan-sjh opened 2 months ago

yanan-sjh commented 2 months ago

I encountered an error while compiling CUDA code using clang++-18. The error message is as follows:

fatal error: error in backend: Cannot select: 0x55722baf9d20: i64,ch = stacksave 0x55722baf9cb0
In function: _Z5entrydPdS_PimP4int2i
clang++-18: error: clang frontend command failed with exit code 70 (use -v to see invocation)
Ubuntu clang version 18.1.0 (++20240220094926+390dcd4cbbf5-1~exp1~20240220214944.50)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
clang++-18: note: diagnostic msg: 
********************

PLEASE ATTACH THE FOLLOWING FILES TO THE BUG REPORT:
Preprocessed source(s) and associated run script(s) are located at:
clang++-18: note: diagnostic msg: /tmp/host-b24319.cu
clang++-18: note: diagnostic msg: /tmp/host-sm_75-8d8351.cu
clang++-18: note: diagnostic msg: /tmp/host-b24319.sh
clang++-18: note: diagnostic msg: 

The kernel code is as follows. The following code can be successfully compiled using nvcc.

#define MAX_THREAD_PER_BLOCK 899
#define MIN_BLOCK_PER_MULTIPROCESSOR 8
static const int SIZE = 2448;
static const int FN = 5;

__host__ void host_callee(ulonglong* param_15, ulonglong* param_17) { }
__device__ __forceinline__ void func_53(const int deep) {
    if ((deep) > (1)) {
    }
    const uint len1 = 1U;
    uchar* p1 = (uchar*)alloca(sizeof(uchar) * (len1));
}

__global__ __launch_bounds__(MAX_THREAD_PER_BLOCK, MIN_BLOCK_PER_MULTIPROCESSOR) void entry(const double input, double* result, double* oval, int* fence, ulong fptr, int2* ldst, const int deep) {
    func_53((1));
}

The CUDA version is 12.1.The program consists of two parts: kernel.cu and host.cuand it is compiled using the following command:

clang++-18 -v -x cuda -I/usr/local/cuda-12.1/lib64 -I/usr/local/cuda-12.1/targets/x86_64-linux/include/ --cuda-path=/usr/local/cuda-12.1/ -c host.cu --cuda-gpu-arch=sm_75

All files are attached.

[Uploading fatal_error.zip…]()

Artem-B commented 2 months ago

Reproducer on godbolt. https://godbolt.org/z/MzGbefhnK

Dynamic stack allocation support has been added in LLVM in https://github.com/llvm/llvm-project/pull/84585 but GPU-side alloca support has not been plumbed through to use it for lowering @llvm.stacksave intrinsic yet.

yanan-sjh commented 2 months ago

Reproducer on godbolt. https://godbolt.org/z/MzGbefhnK

Dynamic stack allocation support has been added in LLVM in #84585 but GPU-side alloca support has not been plumbed through to use it for lowering @llvm.stacksave intrinsic yet.

Okay, I understand, thank you for your reply. By the way, I have a question: when compiling some CUDA code with clang18, it gets stuck at the following point and takes a long time without any result. Is this normal?

15 warnings generated when compiling for sm_89.
ptxas warning : Value of threads per SM for entry _Z5entrydPdS_PimPxi is out of range. .minnctapersm will be ignored

I am using a CUDA code generator to create some CUDA programs (which are usually quite complex, but nvcc can handle them normally). However, when I compile them with clang++-18, I encounter several problems. If you're interested, I can simplify these programs and share them with you.

Artem-B commented 2 months ago

If you run clang compilation with -v you should see which stage of the compilation gets stuck. Considering that there's a ptxas warning, I suspect it's ptxas, which means there's probably not much we can do other than tweak compilation options and see if that may avoid particular PTX pattern ptxas may be unhappy about. It's hard to tell what exactly is the problem.

That said, I'd start with the warning about .minnctapersm. I suspect something in the source code passed an out-of-bounds value to __launch_bounds__. It may or may not have anything to do with the slow compilation, but it would be good to get rid of the issue so it does not complicate things further.

yanan-sjh commented 2 months ago

@Artem-B I have followed your advice and adjusted the __launch_bounds__ parameters. The ptxas warning has disappeared, but the compilation time for my CUDA program remains quite long. Could you please explain what factors influence the duration of the ptxas phase? Thank you.

15 warnings generated when compiling for host.
 "/usr/bin/ld" -z relro --hash-style=gnu --build-id --eh-frame-hdr -m elf_x86_64 -pie -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o clang.bin /usr/lib/x86_64-linux-gnu/Scrt1.o /usr/lib/x86_64-linux-gnu/crti.o /usr/bin/../lib/gcc/x86_64-linux-gnu/13/crtbeginS.o -L/usr/local/cuda/lib64 -L/usr/bin/../lib/gcc/x86_64-linux-gnu/13 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/lib -L/usr/lib /tmp/host-64c66e.o /tmp/kernel-05203d.o -lcudart_static -ldl -lrt -lstdc++ -lm -lgcc_s -lgcc -lpthread -lc -lgcc_s -lgcc /usr/bin/../lib/gcc/x86_64-linux-gnu/13/crtendS.o /usr/lib/x86_64-linux-gnu/crtn.o
clang++-18 -v host.cu kernel.cu -o clang.bin --cuda-gpu-arch=sm_89   -ldl -lr  364.43s user 0.91s system 99% cpu 6:05.46 total
Artem-B commented 1 month ago

Could you please explain what factors influence the duration of the ptxas phase

I have as much visibility into ptxas as everybody else outside of NVIDIA -- none. I can't even give you a good guess, never mind explain what exactly slows ptxas down. It's known to happen now and then, but so far pretty much all of the cases I did happen to look at closely, each had its own unique root cause. Sometimes it was specific to the ptxas version (try different CUDA versions?), sometimes it didn't like the loop structures clang generated (tweaking some LLVM parameters helped in that particular case), sometimes the user tried to compile the code with incredibly large number of small functions that did end up in PTX (putting them into anonymous namespace allowed them to be eliminated before they made it to PTX).

Obviously, the absolute size of PTX input would be a factor, but other than that I can not tell what may be happening in your case.

yanan-sjh commented 1 month ago

Thank you very much for your response. I will try to identify the cause of the issue by reducing the length of the code.