Open JohannesGaessler opened 7 months ago
It's very possible it does not work with the recent llama.cpp. I've tried llama.cpp a year ago: specifically commit 2322ec223a21625dfe9bd73ee677444a98a24ac9. I'm not staying on top of every application. Even for Blender, that Phoronix benchmark on version 4.0 was the first time ZLUDA was used with Blender 4.0.
Here's how I built it:
cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF -DCMAKE_C_FLAGS='-march=native -pthread' -DCMAKE_CXX_FLAGS='-march=native -pthread' -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_CUDA_FLAGS="-ccbin=g++-8 -keep" -GNinja -DCMAKE_BUILD_TYPE=Debug -DGGML_DEBUG=10
And I ran it with:
LD_LIBRARY_PATH="/home/vosen/dev/zluda_private/target/debug:/opt/rocm/lib" bin/./main -m /media/vosen/8A98339298337C2F/Users/vosen/Downloads/LLaMA/7B/ggml-model-q4_0.bin -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 99999 -t 1 -s 1686943578
Please wait a few days. I'll have an article explaining how to produce useful debugging information
i just found the repo few days ago and i havent try it yet but im very exited to give me time to test it out. I also have AMD cards. I think just compiling the latest llamacpp with make LLAMA_CUBLAS=1
it will do and then overwrite the environmental variables for your specific gpu and then follow the instructions to use the ZLUDA. Well thats far I understand how it can work. This weekend going to try
We should make a discord group for ZLUDA and llamacpp
why use locked down platform when matrix exists.
I took a look at llama.cpp and it requires minor additions to the compiler to compile first GPU module. I'm out most of this week, but I'll continue next week
Or maybe not. Please try this pull request: #102. I'm getting a different text output than on an NVIDIA card. Is it ok?
I ran it like this
LD_LIBRARY_PATH=/home/vosen/dev/zluda_private/target/release ./main -m /home/vosen/Downloads/llama-2-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 99999 -t 1 -s 1686943578
I almost got it working on Arch. I didn't expect it to actually work since it packages ROCm 6.0 and CUDA 12.3, but it built fine.
% make LLAMA_CUBLAS=1 NVCC=/opt/cuda/bin/nvcc llama-bench
% LD_LIBRARY_PATH="/home/romain/Downloads/ZLUDA/target/release:$LD_LIBRARY_PATH" ./llama-bench -m ../models/LLaMA2-13B-Estopia-Q4_K_S.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: AMD Radeon RX 6750 XT [ZLUDA], compute capability 8.8, VMM: no
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
:0:rocdevice.cpp :2726: 68612894190 us: [pid:171393 tid:0x7f50d238f000] Callback: Queue 0x7f4fd0300000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
I'm getting a different text output than on an NVIDIA card. Is it ok?
There is a binary called perplexity
which - as the name implies - can be used to calculate the perplexity over a text corpus. If the difference between HIP and ZLUDA is within rounding error the results should be the same. Note: nowadays a better metric to check is Kullback-Leibler divergence but that needs a little more setup.
Wrt. performance. If compute capability is not enough information then ZLUDA could add a CUDA extension to surface whatever llama.cpp needs with the simplest bit being the underlying HIP device arch name (e.g. "gfx1030" for Radeon 6800 XT)? We wouldn't even need dlopen/dlsym. Just a single header using already existing CUDA extension capabilities. Would that help? Theoretically of course, if ZLUDA stays alive.
Does this discussion even make sense? Since llama.cpp already has an optimised HIP backend
Wrt. performance. If compute capability is not enough information then ZLUDA could add a CUDA extension to surface whatever llama.cpp needs with the simplest bit being the underlying HIP device arch name (e.g. "gfx1030" for Radeon 6800 XT)? We wouldn't even need dlopen/dlsym. Just a single header using already existing CUDA extension capabilities. Would that help?
Sorry, I don't understand what you mean. In any case, setting CC 7.0 for RDNA3 and CC 6.1 otherwise should get you ~95% of the optimal performance. And notably the tile sizes for mul_mat_q need to already be known at compile time. It is not possible to change them at runtime.
Does that discussion even make sense? Since llama.cpp already has an optimzied HIP backend
Depends on how ZLUDA performs vs. HIP I'd say.
Hmmm, I assumed this: "tile sizes are fixed for a given architecture, llama.cpp compiles several variants for whatever architectures were chosen at compile time and the during run time llama.cpp code chooses appropriate kernel (mul_mat_q_cc61, mul_mat_q_70)". But I had a quick look at the code at that's not really the case. llama.cpp makes major decisions (BLAS or its own kernels) and particular kernel is chosen by the driver from fatbin (and the fatbin choice must match CC expected by the runtime?).
Because if it were the former then ZLUDA could expose to llama.cpp function CUresult zludaDeviceGetHIPName(char* name, int len, CUdevice dev)
(just a strawman) and llama.cpp could modify its selection to take into account the case of running of CUDA-but-not-really-it-is-ZLUDA.
tile sizes are fixed for a given architecture, llama.cpp compiles several variants for whatever architectures were chosen at compile time and the during run time llama.cpp code chooses appropriate kernel (mul_mat_q_cc61, mul_mat_q_70)
This is how I implemented it at first but the issue is that this causes the compile time and binary size to increase quadratically with the number of architectures because you would, for each architecture, be compiling not only the kernel version that will actually be used but also the kernel versions for all other architectures.
and the fatbin choice must match CC expected by the runtime?.
Yes, otherwise the results would be incorrect.
That's inconvinient. Because I was thinking that ZLUDA could pick appropriate optimal-CC module from fatbin (setting aside the mechanism for it).
For example, I'm looking at ggml_mul_mat_q4_0_q8_1_cuda
. I can see that llama.cpp uses results of cudaGetDeviceProperties(...)
to get CC. Theoretically, would it be a big problem to instead use cudaFuncGetAttributes(...)
and binaryVersion
field?
That should work for the CUDA code (and probably better than the current code). The question is what to do for HIP. There does seem to be an equivalent hipFuncGetAttributes
but the documentation doesn't explicitly tell you that it does the same thing.
sup guys, some buddy already could run llamacpp with zluda? can share the steps please i would like to test it
I created a PR with some changes for q4_0: https://github.com/ggerganov/llama.cpp/pull/5554 . Is this how you imagined it?
Yes, exactly this. Now I'll see what can be done on the ZLUDA side
one of the contributors of RocBLAS told me to try this into docker container. He claim that it make avaible to use all AMD gpus in your pc, old ones, new ones, etc. Havent tested yet but here i left the git
https://gist.github.com/cgmb/be113c04cd740425f637aa33c3e4ea33#file-build-llama-cpp-sh-L3
Any discord group for llama.cpp for ZLUDA @userbox020
I am one of the developers of llama.cpp. The Phoronix article claims that ZLUDA works with llama.cpp but I cannot get it to work on my RX 6800. Moreover, without a patch llama.cpp fails very early when it tries to enable VMM which is not supported with HIP/AMD. Even with a patch to disable VMM I am not able to get it to work; the program either crashes when creating CUDA streams with "operation not supported" or when running a CUDA kernel with "named symbol not found". Instructions for llama.cpp would be appreciated.