Crash when trying to compile `ggml-cuda.cu` from llama.cpp

sin-ack commented 8 months ago

Backtrace:

Call parameter type does not match function signature!
  %StackGuardSlot = alloca ptr, align 8, addrspace(5)
 ptr  call void @llvm.stackprotector(ptr %8, ptr addrspace(5) %StackGuardSlot)
in function _ZL13mul_mat_vec_qILi4ELi32ELi4E10block_q4_0Li2EXadL_ZL17vec_dot_q4_0_q8_1PKvPK10block_q8_1RKiEEEvS2_S2_Pfiiii
LLVM ERROR: Broken function found, compilation aborted!
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.  Program arguments: llc ggml-cuda.cu.ll
1.  Running pass 'CallGraph Pass Manager' on module 'ggml-cuda.cu.ll'.
2.  Running pass 'Module Verifier' on function '@_ZL13mul_mat_vec_qILi4ELi32ELi4E10block_q4_0Li2EXadL_ZL17vec_dot_q4_0_q8_1PKvPK10block_q8_1RKiEEEvS2_S2_Pfiiii'
 #0 0x00007f92af85a06e llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xc5a06e)
 #1 0x00007f92af857a2b llvm::sys::RunSignalHandlers() (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xc57a2b)
 #2 0x00007f92af857ba6 (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xc57ba6)
 #3 0x00007f92ae675dc0 (/lib64/libc.so.6+0x3ddc0)
 #4 0x00007f92ae6c5d9c __pthread_kill_implementation /var/tmp/portage/sys-libs/glibc-2.38-r10/work/glibc-2.38/nptl/pthread_kill.c:44:76
 #5 0x00007f92ae675d12 gsignal /var/tmp/portage/sys-libs/glibc-2.38-r10/work/glibc-2.38/signal/../sysdeps/posix/raise.c:27:6
 #6 0x00007f92ae65e4ed abort /var/tmp/portage/sys-libs/glibc-2.38-r10/work/glibc-2.38/stdlib/abort.c:81:7
 #7 0x00007f92af40ffb7 (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0x80ffb7)
 #8 0x00007f92af7921ca (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xb921ca)
 #9 0x00007f92afa55763 (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xe55763)
#10 0x00007f92af9bc3bc llvm::FPPassManager::runOnFunction(llvm::Function&) (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xdbc3bc)
#11 0x00007f92b0f79ca9 (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0x2379ca9)
#12 0x00007f92af9bcd51 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/usr/lib/llvm/17/bin/../lib64/libLLVM-17.so+0xdbcd51)
#13 0x0000560f4c6f11e8 (/usr/lib/llvm/17/bin/llc+0x1b1e8)
#14 0x0000560f4c6e6114 main (/usr/lib/llvm/17/bin/llc+0x10114)
#15 0x00007f92ae65feea __libc_start_call_main /var/tmp/portage/sys-libs/glibc-2.38-r10/work/glibc-2.38/csu/../sysdeps/nptl/libc_start_call_main.h:74:3
#16 0x00007f92ae65ffa5 call_init /var/tmp/portage/sys-libs/glibc-2.38-r10/work/glibc-2.38/csu/../csu/libc-start.c:128:20
#17 0x00007f92ae65ffa5 __libc_start_main /var/tmp/portage/sys-libs/glibc-2.38-r10/work/glibc-2.38/csu/../csu/libc-start.c:347:5
#18 0x0000560f4c6e64e1 _start (/usr/lib/llvm/17/bin/llc+0x104e1)
[1]    15924 IOT instruction  llc ggml-cuda.cu.ll

LLVM IR file: ggml-cuda.cu.ll.gz

The IR was generated using Clang 17.0.6 and hipBLAS 5.7.1, from ggml-cuda.cu in https://github.com/ggerganov/llama.cpp/commit/67be2ce1015d070b3b2cd488bcb041eefb61de72

Command used to generate the IR

`/usr/lib/llvm/17/bin/clang++ -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DGGML_USE_HIPBLAS -DK_QUANTS_PER_ITERATION=2 -D_GNU_SOURCE -D_XOPEN_SOURCE=600 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -I/labs/llama.cpp/. -isystem /usr/include/rocblas --rocm-device-lib-path=/usr/lib/amdgcn/bitcode/ -O3 -DNDEBUG -std=gnu++11 -Wmissing-declarations -Wmissing-noreturn -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -march=native -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -x hip -MD -MT CMakeFiles/ggml.dir/ggml-cuda.cu.o -o ggml-cuda.cu.ll -S /labs/llama.cpp/ggml-cuda.cu -emit-llvm`

clang --version

``` clang version 17.0.6 Target: x86_64-pc-linux-gnu Thread model: posix InstalledDir: /usr/lib/llvm/17/bin Configuration file: /etc/clang/x86_64-pc-linux-gnu-clang.cfg ```

sin-ack commented 8 months ago

Reduced to:

target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-p7:160:256:256:32-p8:128:128-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7:8"
target triple = "amdgcn-amd-amdhsa"

; Function Attrs: sspstrong
define amdgpu_kernel void @_ZL13mul_mat_vec_qILi4ELi32ELi4E10block_q4_0Li2EXadL_ZL17vec_dot_q4_0_q8_1PKvPK10block_q8_1RKiEEEvS2_S2_Pfiiii() #0 {
  %1 = alloca [4 x [2 x float]], i32 0, align 16, addrspace(5)
  call void @llvm.memset.p5.i64(ptr addrspace(5) %1, i8 0, i64 0, i1 false)
  ret void
}

; Function Attrs: nocallback nofree nounwind willreturn memory(argmem: write)
declare void @llvm.memset.p5.i64(ptr addrspace(5) nocapture writeonly, i8, i64, i1 immarg) #1

attributes #0 = { sspstrong }
attributes #1 = { nocallback nofree nounwind willreturn memory(argmem: write) }

Artem-B commented 8 months ago

It appears that -fstack-protector somehow got enabled on the GPU side. I'm not sure whether AMDGPU supports it, but I would expect that to be a problem for NVPTX.

Disabling stack protector on the GPU side should avoid the problem.

AngryLoki commented 8 months ago

/usr/lib/llvm/17/bin/clang++ on Gentoo enables -fstack-protector-strong for all targets in /etc/clang/x86_64-pc-linux-gnu-clang.cfg -> gentoo-common.cfg -> gentoo-hardened.cfg.

This was previously discussed in https://github.com/llvm/llvm-project/issues/62066 and fixed in https://github.com/llvm/llvm-project/pull/70799 in 18.1.0 release.

Additionally, on Gentoo side multiple patches were added to hipcc and rocm-runtime to add -fno-stack-protector when user compiles code with hipcc wrapper or from rocm runtime while using Clang-17 (sorry, can't do better than that; Gentoo does not backport patches for LLVM). Just use hipcc, it will add multiple flags as described in https://wiki.gentoo.org/wiki/HIP#hipcc_.28Clang_wrapper.29

Regarding Clang-18 support in HIP, today I did few experiments and with few patches it worked, but encountered huge memory consumption in https://github.com/llvm/llvm-project/issues/86332 - which looks like a blocker... So Gentoo will probably stay on LLVM-17 for hipcc in nearest time.

JonChesterfield commented 7 months ago

I don't believe amdgpu has stack-protector either. I would guess the desired behaviour of -x cuda -fstack-protector would be to enable the stack protector on the x64 code and do nothing on the gpu code, at least until such time as that's implemented on the gpu. Maybe emit a warning in the meantime.

Do we have a general purpose way of specifying pass some argument to the host clang invocation and some other argument to the device invocation? Openmp has/had some means of doing that which worked in some cases.

Artem-B commented 7 months ago

We do not have a consistent way to handle arguments that don't have the same level of support between host and the GPU. So far, in most commonly encountered cases (e.g. sanitizers) we've been filtering out such arguments on the case by case basis, and that's not ideal.

We do have -Xarch_host and -Xarch_device which may be used to override top-level flags, but it does not always work if top-level flags get converted into a set of different cc1 arguments.

llvm / llvm-project

Crash when trying to compile `ggml-cuda.cu` from llama.cpp #83777