Closed Raphy42 closed 6 months ago
https://github.com/utilityai/llama-cpp-rs/commit/756646ccffaca8049198d81ab184d56083f9162d
Did this commi fix the issue?
Does it fail to compile or does it crash?
Forgive me, Mac and metal are completely foreign to me.
Sadly it doesn't, I'm currently messing around with my own public fork which uses the cmake
crate to build everything, which works !
I don't remember what I did anymore but I've successfully changed the build.rs
to build ggml-metal-embed.metal
which is then copied into the package workspace
, I then added linking arguments until it worked
llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from /Users/<>/.cache/huggingface/hub/models--andrewcanis--c4ai-command-r-v01-GGUF/snapshots/7629a21caf04be51c9010f3ece50e4f8178e0ef1/c4ai-command-r-v01-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = command-r
llama_model_loader: - kv 1: general.name str = 9fe64d67d13873f218cb05083b6fc2faab2d034a
llama_model_loader: - kv 2: command-r.block_count u32 = 40
llama_model_loader: - kv 3: command-r.context_length u32 = 131072
llama_model_loader: - kv 4: command-r.embedding_length u32 = 8192
llama_model_loader: - kv 5: command-r.feed_forward_length u32 = 22528
llama_model_loader: - kv 6: command-r.attention.head_count u32 = 64
llama_model_loader: - kv 7: command-r.attention.head_count_kv u32 = 64
llama_model_loader: - kv 8: command-r.rope.freq_base f32 = 8000000.000000
llama_model_loader: - kv 9: command-r.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 15
llama_model_loader: - kv 11: command-r.logit_scale f32 = 0.062500
llama_model_loader: - kv 12: command-r.rope.scaling.type str = none
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,256000] = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,253333] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 5
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 255001
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 41 tensors
llama_model_loader: - type q4_K: 240 tensors
llama_model_loader: - type q6_K: 41 tensors
.......
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 1640.62 MiB
llm_load_tensors: Metal buffer size = 20519.41 MiB
........................................................................................
2024-03-20T20:57:00.994794Z DEBUG load_from_file: llama_cpp_2::model: Loaded model path="/Users/<>/.cache/huggingface/hub/models--andrewcanis--c4ai-command-r-v01-GGUF/snapshots/7629a21caf04be51c9010f3ece50e4f8178e0ef1/c4ai-command-r-v01-Q4_K_M.gguf"
llama_new_context_with_model: n_ctx = 131072
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Metal KV buffer size = 163840.00 MiB
llama_new_context_with_model: KV self size = 163840.00 MiB, K (f16): 81920.00 MiB, V (f16): 81920.00 MiB
llama_new_context_with_model: CPU output buffer size = 2000.00 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 17515415552
llama_new_context_with_model: failed to allocate compute buffers
But my changes break CUDA support and other platforms than macos
, I don't have the bandwith to stabilise my fork right now, but I will be sure to open a PR once I have reproducible build on both platforms, but I don't know when, if anyone want to give my fork a try, feel free to do so !
Sadly it doesn't, I'm currently messing around with my own public fork which uses the cmake crate to build everything, which works !
I've been meaning to give this a try - I'd be happy to have a PR to this effect once it's ready. If you open a draft I can edit I can try to work on linux + cuda.
Sure ! Current edits are a bit hacky but this is my current impl https://github.com/Raphy42/llama-cpp-rs/commit/44b6da48bb8d3522c51aa86eba2ded4096672777 I need to dig deeper in the llama.cpp build system in order to make the build.rs nicer and more parametric
@Raphy42 does #221 fix it? (no llama_hack, unsure if needed if using cmake)
may be fixed on latest. Unable to test myself but #224 apparently fixes this, please let me know.
@Raphy42 does #221 fix it? (no llama_hack, unsure if needed if using cmake)
Yeah, "0.1.45"
works out of the box for me, tested on gemma-7b
and multiple llama-2
variants with the same inference speed from before !
Amazing job !
i just merged a PR. All credit to derrickpersson!
Glad it works!
I updated to the latest version of the library, as I needed to have the
command-r
architecture support, but the curentcrates.io
andmain
version currently crash on MacOS due tometal_hack
breaking in the latest version ofllama.cpp
The culprit isggm-common.h
which is not avaiable to the bundled shader. I have tried replacing the.h
by it's actual content, prior to putting it inside the.m
loader, but it's not as simple and is not going to be maintainable at all.I saw on the llama.cpp issues that this could be fixed by having the
default.metallib
built by the CMake project, but this would imply modifying the currentbuild.rs
heavily, and I have noCUDA
compatible machine.