vosen / ZLUDA

CUDA on ??? GPUs
Apache License 2.0
8.92k stars 589 forks source link

ZLUDA for llama.cpp #64

Open JohannesGaessler opened 7 months ago

JohannesGaessler commented 7 months ago

I am one of the developers of llama.cpp. The Phoronix article claims that ZLUDA works with llama.cpp but I cannot get it to work on my RX 6800. Moreover, without a patch llama.cpp fails very early when it tries to enable VMM which is not supported with HIP/AMD. Even with a patch to disable VMM I am not able to get it to work; the program either crashes when creating CUDA streams with "operation not supported" or when running a CUDA kernel with "named symbol not found". Instructions for llama.cpp would be appreciated.

vosen commented 7 months ago

It's very possible it does not work with the recent llama.cpp. I've tried llama.cpp a year ago: specifically commit 2322ec223a21625dfe9bd73ee677444a98a24ac9. I'm not staying on top of every application. Even for Blender, that Phoronix benchmark on version 4.0 was the first time ZLUDA was used with Blender 4.0.

Here's how I built it:

cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF -DCMAKE_C_FLAGS='-march=native -pthread' -DCMAKE_CXX_FLAGS='-march=native -pthread' -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc -DCMAKE_CUDA_FLAGS="-ccbin=g++-8 -keep" -GNinja -DCMAKE_BUILD_TYPE=Debug -DGGML_DEBUG=10

And I ran it with:

LD_LIBRARY_PATH="/home/vosen/dev/zluda_private/target/debug:/opt/rocm/lib" bin/./main -m /media/vosen/8A98339298337C2F/Users/vosen/Downloads/LLaMA/7B/ggml-model-q4_0.bin  -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 99999 -t 1 -s 1686943578

Please wait a few days. I'll have an article explaining how to produce useful debugging information

userbox020 commented 7 months ago

i just found the repo few days ago and i havent try it yet but im very exited to give me time to test it out. I also have AMD cards. I think just compiling the latest llamacpp with make LLAMA_CUBLAS=1 it will do and then overwrite the environmental variables for your specific gpu and then follow the instructions to use the ZLUDA. Well thats far I understand how it can work. This weekend going to try

userbox020 commented 7 months ago

We should make a discord group for ZLUDA and llamacpp

jaredmontoya commented 7 months ago

why use locked down platform when matrix exists.

vosen commented 7 months ago

I took a look at llama.cpp and it requires minor additions to the compiler to compile first GPU module. I'm out most of this week, but I'll continue next week

vosen commented 7 months ago

Or maybe not. Please try this pull request: #102. I'm getting a different text output than on an NVIDIA card. Is it ok?

I ran it like this

LD_LIBRARY_PATH=/home/vosen/dev/zluda_private/target/release  ./main -m /home/vosen/Downloads/llama-2-7b.Q4_K_M.gguf  -p "Building a website can be done in 10 simple steps:" -n 512 -ngl 99999 -t 1 -s 1686943578
and my output: ``` ./main: /home/vosen/dev/zluda_private/target/release/libcublas.so.11: no version information available (required by ./main) Log start main: build = 2150 (594fca3f) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1686943578 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: AMD Radeon RX 6800 XT [ZLUDA], compute capability 8.8, VMM: no llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /home/vosen/Downloads/llama-2-7b.Q4_K_M.gguf (version GGUF V2) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0,000010 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0,000000, 0,000000, 0,000000, 0,0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V2 llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-05 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 6,74 B llm_load_print_meta: model size = 3,80 GiB (4,84 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0,22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 70,31 MiB llm_load_tensors: CUDA0 buffer size = 3820,94 MiB .................................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: freq_base = 10000,0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 256,00 MiB llama_new_context_with_model: KV self size = 256,00 MiB, K (f16): 128,00 MiB, V (f16): 128,00 MiB llama_new_context_with_model: CUDA_Host input buffer size = 10,01 MiB llama_new_context_with_model: CUDA0 compute buffer size = 70,50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8,00 MiB llama_new_context_with_model: graph splits (measure): 3 system_info: n_threads = 1 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1,100, frequency_penalty = 0,000, presence_penalty = 0,000 top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0 Building a website can be done in 10 simple steps: Step 1. Creating the foundation of your site - Domain name, hosting and domain registration The first step to building an online presence is choosing a domain name for your web address. You will need to register the name with the appropriate registrar (in our case, it’s GoDaddy) as well as buy the necessary domain name and hosting to launch your site. Registering a domain name is a very important part of building your website. It’s the first thing people see when they visit you online, so make sure that it’s professional, memorable and easy for your customers to find you. You can register your domain through a registrar like GoDaddy or Bluehost who will help you create an account, purchase hosting services (if necessary), set up DNS records for the site (if necessary) as well as give advice on how best use their system. If you’re setting up a new website but want more control over your hosting needs than traditional shared hosting provides then consider using VPS Hosting instead which gives users access to dedicated resources that aren't shared among other customers like regular plans do! Once everything has been set up correctly on both ends it should take less than 10 minutes total before visitors start seeing your site live online. Step 2. Choose a website builder and pick your design One of the first steps to building your website is choosing an appropriate website builder for you. There are many options, but if you're looking for something that offers great features without breaking the bank then consider using Wix or Weebly as they offer both free and paid plans with unlimited storage space and bandwidth (so there isn't any limit on how much content can be stored). Once you have chosen your website builder, it’s time to decide what design layout you want for your site. You can start by choosing between one of the many templates that are available or create a custom design from scratch if you prefer! If this is not possible then consider using WordPress which allows users to choose any theme they like and install plugins as well as add widgets onto their page layout so it looks exactly how they want it too. Step 3. Choose your domain name (if applicable) and set up hosting for it Now that you have chosen a web hosting plan, it’s time to select a domain name if one is needed. In some cases, this step will be skipped or taken care of by the website builder itself llama_print_timings: load time = 9844,08 ms llama_print_timings: sample time = 47,81 ms / 512 runs ( 0,09 ms per token, 10709,06 tokens per second) llama_print_timings: prompt eval time = 43,23 ms / 14 tokens ( 3,09 ms per token, 323,83 tokens per second) llama_print_timings: eval time = 7921,42 ms / 511 runs ( 15,50 ms per token, 64,51 tokens per second) llama_print_timings: total time = 8095,22 ms / 525 tokens Log end ```
Artefact2 commented 7 months ago

I almost got it working on Arch. I didn't expect it to actually work since it packages ROCm 6.0 and CUDA 12.3, but it built fine.

% make LLAMA_CUBLAS=1 NVCC=/opt/cuda/bin/nvcc llama-bench
% LD_LIBRARY_PATH="/home/romain/Downloads/ZLUDA/target/release:$LD_LIBRARY_PATH" ./llama-bench -m ../models/LLaMA2-13B-Estopia-Q4_K_S.gguf
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: AMD Radeon RX 6750 XT [ZLUDA], compute capability 8.8, VMM: no
| model                          |       size |     params | backend    | ngl | test       |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------- | ---------------: |
:0:rocdevice.cpp            :2726: 68612894190 us: [pid:171393 tid:0x7f50d238f000] Callback: Queue 0x7f4fd0300000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
JohannesGaessler commented 7 months ago

I'm getting a different text output than on an NVIDIA card. Is it ok?

There is a binary called perplexity which - as the name implies - can be used to calculate the perplexity over a text corpus. If the difference between HIP and ZLUDA is within rounding error the results should be the same. Note: nowadays a better metric to check is Kullback-Leibler divergence but that needs a little more setup.

vosen commented 6 months ago

Wrt. performance. If compute capability is not enough information then ZLUDA could add a CUDA extension to surface whatever llama.cpp needs with the simplest bit being the underlying HIP device arch name (e.g. "gfx1030" for Radeon 6800 XT)? We wouldn't even need dlopen/dlsym. Just a single header using already existing CUDA extension capabilities. Would that help? Theoretically of course, if ZLUDA stays alive.

Does this discussion even make sense? Since llama.cpp already has an optimised HIP backend

JohannesGaessler commented 6 months ago

Wrt. performance. If compute capability is not enough information then ZLUDA could add a CUDA extension to surface whatever llama.cpp needs with the simplest bit being the underlying HIP device arch name (e.g. "gfx1030" for Radeon 6800 XT)? We wouldn't even need dlopen/dlsym. Just a single header using already existing CUDA extension capabilities. Would that help?

Sorry, I don't understand what you mean. In any case, setting CC 7.0 for RDNA3 and CC 6.1 otherwise should get you ~95% of the optimal performance. And notably the tile sizes for mul_mat_q need to already be known at compile time. It is not possible to change them at runtime.

Does that discussion even make sense? Since llama.cpp already has an optimzied HIP backend

Depends on how ZLUDA performs vs. HIP I'd say.

vosen commented 6 months ago

Hmmm, I assumed this: "tile sizes are fixed for a given architecture, llama.cpp compiles several variants for whatever architectures were chosen at compile time and the during run time llama.cpp code chooses appropriate kernel (mul_mat_q_cc61, mul_mat_q_70)". But I had a quick look at the code at that's not really the case. llama.cpp makes major decisions (BLAS or its own kernels) and particular kernel is chosen by the driver from fatbin (and the fatbin choice must match CC expected by the runtime?).

Because if it were the former then ZLUDA could expose to llama.cpp function CUresult zludaDeviceGetHIPName(char* name, int len, CUdevice dev) (just a strawman) and llama.cpp could modify its selection to take into account the case of running of CUDA-but-not-really-it-is-ZLUDA.

JohannesGaessler commented 6 months ago

tile sizes are fixed for a given architecture, llama.cpp compiles several variants for whatever architectures were chosen at compile time and the during run time llama.cpp code chooses appropriate kernel (mul_mat_q_cc61, mul_mat_q_70)

This is how I implemented it at first but the issue is that this causes the compile time and binary size to increase quadratically with the number of architectures because you would, for each architecture, be compiling not only the kernel version that will actually be used but also the kernel versions for all other architectures.

and the fatbin choice must match CC expected by the runtime?.

Yes, otherwise the results would be incorrect.

vosen commented 6 months ago

That's inconvinient. Because I was thinking that ZLUDA could pick appropriate optimal-CC module from fatbin (setting aside the mechanism for it). For example, I'm looking at ggml_mul_mat_q4_0_q8_1_cuda. I can see that llama.cpp uses results of cudaGetDeviceProperties(...) to get CC. Theoretically, would it be a big problem to instead use cudaFuncGetAttributes(...) and binaryVersion field?

JohannesGaessler commented 6 months ago

That should work for the CUDA code (and probably better than the current code). The question is what to do for HIP. There does seem to be an equivalent hipFuncGetAttributes but the documentation doesn't explicitly tell you that it does the same thing.

userbox020 commented 6 months ago

sup guys, some buddy already could run llamacpp with zluda? can share the steps please i would like to test it

JohannesGaessler commented 6 months ago

I created a PR with some changes for q4_0: https://github.com/ggerganov/llama.cpp/pull/5554 . Is this how you imagined it?

vosen commented 6 months ago

Yes, exactly this. Now I'll see what can be done on the ZLUDA side

userbox020 commented 6 months ago

one of the contributors of RocBLAS told me to try this into docker container. He claim that it make avaible to use all AMD gpus in your pc, old ones, new ones, etc. Havent tested yet but here i left the git

https://gist.github.com/cgmb/be113c04cd740425f637aa33c3e4ea33#file-build-llama-cpp-sh-L3

0xTong commented 1 month ago

Any discord group for llama.cpp for ZLUDA @userbox020