utilityai / llama-cpp-rs

171 stars 50 forks source link

Text Generation on Mac outputs `<|reserved_special_token_247|>` #591

Open mojadem opened 1 day ago

mojadem commented 1 day ago

When trying to run any Llama model (such as https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF), text generation only outputs <|reserved_special_token_247|>.

Here is the output when running cargo run --release --bin simple -- --prompt "The way to kill a linux process is" local Llama-3.2-1B-Instruct-Q4_K_M.gguf

llama_model_loader: loaded meta data with 35 key-value pairs and 147 tensors from /Users/mojadem/.local/share/gguf/Llama-3.2-1B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                            general.license str              = llama3.2
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 16
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  19:                          general.file_type u32              = 15
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  29:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  30:               general.quantization_version u32              = 2
llama_model_loader: - kv  31:                      quantize.imatrix.file str              = /models_out/Llama-3.2-1B-Instruct-GGU...
llama_model_loader: - kv  32:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  33:             quantize.imatrix.entries_count i32              = 112
llama_model_loader: - kv  34:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q4_K:   96 tensors
llama_model_loader: - type q6_K:   17 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 762.81 MiB (5.18 BPW)
llm_load_print_meta: general.name     = Llama 3.2 1B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =   762.81 MiB, (  762.88 / 10922.67)
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 17/17 layers to GPU
llm_load_tensors:        CPU buffer size =   205.49 MiB
llm_load_tensors:      Metal buffer size =   762.81 MiB
.......................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2
ggml_metal_init: picking default device: Apple M2
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M2
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init:      Metal KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =   254.50 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 2
n_len = 32, n_ctx = 2048, k_kv_req = 32

The way to kill a linux process is<|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|><|reserved_special_token_247|>

decoded 24 tokens in 0.32 s, speed 74.74 t/s

load time = 604.40 ms
prompt eval time = 0.00 ms / 9 tokens (0.00 ms per token, inf tokens per second)
eval time = 0.00 ms / 23 runs (0.00 ms per token, inf tokens per second)

ggml_metal_free: deallocating
babichjacob commented 1 day ago

That is an instruct model. It's the Llama 3.2 1B base model but fine-tuned for chat usage (i.e. between a "user" and an "assistant"), so its ability to autocomplete text like this has been impacted.

You must either switch the model with a base model GGUF (which BTW bartowski doesn't provide any of - only instruct / chat) e.g. from here, or reformulate your prompt to fit how the instruct model works. The prompt template is found in the official documentation here, but bartowski is also great at providing it in the README.

So, to get the instruct model to address the query at hand, pass this as an argument instead. You can refine the text to what you find gives you the best results as long as you adhere to the required template:

--prompt "<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 2 Dec 2024\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow do I kill a linux process?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThe way to kill a Linux process is"
babichjacob commented 1 day ago

That being said, I am also getting broken outputs out of the example on Windows. I tested with Llama 3.2 1B instruct like you, with or without the proper template, and with Gemma 2 2B instruct, with or without the proper template. I have to go now but I intend to investigate more later.

babichjacob commented 20 hours ago

It's just a bug upstream in llama.cpp that got fixed somewhere after the commit that is currently being submodule'd in to this repository. Didn't track down where or why, but there's only one line of code that needs to be changed here to be able to upgrade to the newest commit, thanks to all the work done in #580

Pull request coming shortly Actually, #590 already handles it! Additionally, #589 relates.

MarcusDunn commented 6 hours ago

should be resolved by #590

MarcusDunn commented 6 hours ago

nevermind - publish seems to have failed. I will look into it this weekend if no one else is able to make a PR

babichjacob commented 5 hours ago

I'll try my hand at it. It should just be updating the include files like last time