utilityai / llama-cpp-rs

147 stars 46 forks source link

ggml_metal_init: ggm-common.h not found #211

Closed Raphy42 closed 6 months ago

Raphy42 commented 6 months ago

I updated to the latest version of the library, as I needed to have the command-r architecture support, but the curent crates.io and main version currently crash on MacOS due to metal_hack breaking in the latest version of llama.cpp The culprit is ggm-common.h which is not avaiable to the bundled shader. I have tried replacing the .h by it's actual content, prior to putting it inside the .m loader, but it's not as simple and is not going to be maintainable at all.

ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "program_source:3:10: fatal error: 'ggml-common.h' file not found

I saw on the llama.cpp issues that this could be fixed by having the default.metallib built by the CMake project, but this would imply modifying the current build.rs heavily, and I have no CUDA compatible machine.

MarcusDunn commented 6 months ago

https://github.com/utilityai/llama-cpp-rs/commit/756646ccffaca8049198d81ab184d56083f9162d

Did this commi fix the issue?

Does it fail to compile or does it crash?

Forgive me, Mac and metal are completely foreign to me.

Raphy42 commented 6 months ago

Sadly it doesn't, I'm currently messing around with my own public fork which uses the cmake crate to build everything, which works ! I don't remember what I did anymore but I've successfully changed the build.rs to build ggml-metal-embed.metal which is then copied into the package workspace, I then added linking arguments until it worked

llama_model_loader: loaded meta data with 23 key-value pairs and 322 tensors from /Users/<>/.cache/huggingface/hub/models--andrewcanis--c4ai-command-r-v01-GGUF/snapshots/7629a21caf04be51c9010f3ece50e4f8178e0ef1/c4ai-command-r-v01-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = 9fe64d67d13873f218cb05083b6fc2faab2d034a
llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   41 tensors
llama_model_loader: - type q4_K:  240 tensors
llama_model_loader: - type q6_K:   41 tensors
.......
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:        CPU buffer size =  1640.62 MiB
llm_load_tensors:      Metal buffer size = 20519.41 MiB
........................................................................................
2024-03-20T20:57:00.994794Z DEBUG load_from_file: llama_cpp_2::model: Loaded model path="/Users/<>/.cache/huggingface/hub/models--andrewcanis--c4ai-command-r-v01-GGUF/snapshots/7629a21caf04be51c9010f3ece50e4f8178e0ef1/c4ai-command-r-v01-Q4_K_M.gguf"
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      Metal KV buffer size = 163840.00 MiB
llama_new_context_with_model: KV self size  = 163840.00 MiB, K (f16): 81920.00 MiB, V (f16): 81920.00 MiB
llama_new_context_with_model:        CPU  output buffer size =  2000.00 MiB
ggml_gallocr_reserve_n: failed to allocate Metal buffer of size 17515415552
llama_new_context_with_model: failed to allocate compute buffers

But my changes break CUDA support and other platforms than macos, I don't have the bandwith to stabilise my fork right now, but I will be sure to open a PR once I have reproducible build on both platforms, but I don't know when, if anyone want to give my fork a try, feel free to do so !

MarcusDunn commented 6 months ago

Sadly it doesn't, I'm currently messing around with my own public fork which uses the cmake crate to build everything, which works !

I've been meaning to give this a try - I'd be happy to have a PR to this effect once it's ready. If you open a draft I can edit I can try to work on linux + cuda.

Raphy42 commented 6 months ago

Sure ! Current edits are a bit hacky but this is my current impl https://github.com/Raphy42/llama-cpp-rs/commit/44b6da48bb8d3522c51aa86eba2ded4096672777 I need to dig deeper in the llama.cpp build system in order to make the build.rs nicer and more parametric

MarcusDunn commented 6 months ago

@Raphy42 does #221 fix it? (no llama_hack, unsure if needed if using cmake)

MarcusDunn commented 6 months ago

may be fixed on latest. Unable to test myself but #224 apparently fixes this, please let me know.

Raphy42 commented 6 months ago

@Raphy42 does #221 fix it? (no llama_hack, unsure if needed if using cmake)

Yeah, "0.1.45" works out of the box for me, tested on gemma-7b and multiple llama-2 variants with the same inference speed from before ! Amazing job !

MarcusDunn commented 6 months ago

i just merged a PR. All credit to derrickpersson!

Glad it works!