ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
https://ngxson.github.io/wllama/examples/basic/
MIT License
231 stars 5 forks source link

missing pre-tokenizer type #41

Closed flatsiedatsie closed 1 month ago

flatsiedatsie commented 1 month ago

I sometimes see this warning. Is it something to be worried about?

Screenshot 2024-05-16 at 10 07 37

Is this perhaps related to the need for all .guff files needing to be remade after the Llama.cpp project ran into a bug with Llama 3?

In this case the model crashed, but I suspect that has more to do with me not properly unloading the previous model (Mistral 7) before switching to this one (Phi 2).

Screenshot 2024-05-16 at 10 06 50
ngxson commented 1 month ago

It seems like the problem comes from llama.cpp, can you try loading the model with llama.cpp native?

flatsiedatsie commented 1 month ago

The model crashes after a refresh of the page too, so it wasn't having Mistral loaded first. Hmm.

can you try loading the model with llama.cpp native?

That's been a while. I'll try. The model loaded OK with Llama_cpp_wasm.

Here's the .gguf: https://huggingface.co/afrideva/phi-2-meditron-GGUF/resolve/main/phi-2-meditron.q4_k_m.gguf

flatsiedatsie commented 1 month ago

It runs with llama.cpp. It's not chunked mind you (in fact, I was testing it all below 2Gb models still work without chunking, and it crashed on the first one :-D )

full log ./main -m ./meditron.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e Log start main: build = 2901 (3b3963c5) main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0 main: seed = 1715848707 llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from ./meditron.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi2 llama_model_loader: - kv 1: general.name str = Phi2 llama_model_loader: - kv 2: phi2.context_length u32 = 2048 llama_model_loader: - kv 3: phi2.embedding_length u32 = 2560 llama_model_loader: - kv 4: phi2.feed_forward_length u32 = 10240 llama_model_loader: - kv 5: phi2.block_count u32 = 32 llama_model_loader: - kv 6: phi2.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi2.rope.dimension_count u32 = 32 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,51200] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,51200] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 50256 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 50256 llama_model_loader: - kv 19: general.quantization_version u32 = 2 llama_model_loader: - type f32: 195 tensors llama_model_loader: - type q4_K: 81 tensors llama_model_loader: - type q5_K: 32 tensors llama_model_loader: - type q6_K: 17 tensors llm_load_vocab: missing pre-tokenizer type, using: 'default' llm_load_vocab: llm_load_vocab: ************************************ llm_load_vocab: GENERATION QUALITY WILL BE DEGRADED! llm_load_vocab: CONSIDER REGENERATING THE MODEL llm_load_vocab: ************************************ llm_load_vocab: llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = phi2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 51200 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2560 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 80 llm_load_print_meta: n_embd_head_v = 80 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2560 llm_load_print_meta: n_embd_v_gqa = 2560 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 10240 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 3B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 2.78 B llm_load_print_meta: model size = 1.66 GiB (5.14 BPW) llm_load_print_meta: general.name = Phi2 llm_load_print_meta: BOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOS token = 50256 '<|endoftext|>' llm_load_print_meta: UNK token = 50256 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_tensors: ggml ctx size = 0.32 MiB ggml_backend_metal_log_allocated_size: allocated buffer, size = 1634.33 MiB, ( 1634.39 / 10922.67) llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: Metal buffer size = 1634.32 MiB llm_load_tensors: CPU buffer size = 70.31 MiB ................................................................................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 Pro ggml_metal_init: picking default device: Apple M1 Pro ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/Users/teefrapsidasie/Downloads/llama_cpp_test/llama.cpp/ggml-metal.metal' ggml_metal_init: GPU name: Apple M1 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB llama_kv_cache_init: Metal KV buffer size = 160.00 MiB llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB llama_new_context_with_model: CPU output buffer size = 0.20 MiB llama_new_context_with_model: Metal compute buffer size = 105.00 MiB llama_new_context_with_model: CPU compute buffer size = 6.01 MiB llama_new_context_with_model: graph nodes = 1225 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 6 / 8 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 0 Building a website can be done in 10 simple steps: Step 1: Choose a site builder There are many different websites that can be used as a site builder to create a website. Most of these websites can be used for free. Some of these websites include Weebly, Wix, and Weebly. Step 2: Create your domain name The domain name of your website will be the name that everyone can type into their browser to visit your website. You may choose to use your personal name or business name as the domain name of your website. Step 3: Create your homepage The homepage of your website will be the first page that visitors see when they visit your website. On your homepage you will want to include a welcome message, your name, and your website's name. Step 4: Add your content You will need to add your content to your website. This can include text, images, and videos. You can add your content to your website by creating different pages for each topic. Step 5: Add your social media links You will want to add your social media links to your website. This will allow visitors to easily share your website on social media. Step 6: Add your contact information You will want to add your contact information to your website. This can include your email address and phone number. Step 7: Add your website's logo You will want to add your website's logo to your website. This will help visitors to identify your website. Step 8: Add your website's menu You will want to add your website's menu to your website. This will allow visitors to navigate through your website. Step 9: Test your website You will want to test your website to make sure that everything is working correctly. You can test your website by visiting it on different browsers. Step 10: Publish your website You will want to publish your website so that it is available on the internet. You can publish your website by creating an account on the website that you llama_print_timings: load time = 7502.60 ms llama_print_timings: sample time = 11.12 ms / 400 runs ( 0.03 ms per token, 35964.75 tokens per second) llama_print_timings: prompt eval time = 125.61 ms / 17 tokens ( 7.39 ms per token, 135.34 tokens per second) llama_print_timings: eval time = 9550.21 ms / 399 runs ( 23.94 ms per token, 41.78 tokens per second) llama_print_timings: total time = 9760.36 ms / 416 tokens ggml_metal_free: deallocating Log end
ngxson commented 1 month ago

I believe that the problem comes from the model itself or llama.cpp. That shouldn't be a problem from wllama.

In anyway, I'll update to latest upstream source code today to see if it's fixed.

flatsiedatsie commented 1 month ago

I continued testing, and it's happening with all the .gguf models now.

When I uncomment model_settings['cache_type_k'] = 'q4_0'; everything works again.

E.g. https://huggingface.co/TheBloke/rocket-3B-GGUF/resolve/main/rocket-3b.Q5_K_M.gguf https://huggingface.co/afrideva/Nous-Capybara-3B-V1.9-GGUF/resolve/main/nous-capybara-3b-v1.9.q5_k_m.gguf

flatsiedatsie commented 1 month ago

I was wondering if it had to do with the .gguf models not being quantized in Q4 themselves.

With cache_type_k set to q4_0 the minuscule qwen does load. https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_0.gguf

Hmm, NeuralReyna, which is chunked, also loads. It's also natively Q4.

flatsiedatsie commented 1 month ago

Did another test. Chunked a Q4 version of Phi 3 128K (2.18Gb) into 9 chunks, and it loads an runs OK with the Q4 cache type enabled.

ngxson commented 1 month ago

Yeah I heard a while a go on llama.cpp that cache type Q4 does not play very nice with phi models. Since quantize cache is an experimental thing on llama.cpp, we may expect it to fail on some models.

flatsiedatsie commented 1 month ago

Well, here it's the opposite :-D The Phi models do work.

Since quantize cache is an experimental thing on llama.cpp, we may expect it to fail on some models.

Aha! I didn't know that. Thank you.

I'm going to test a bit more, and see if the issue can be solved by chunking those Q5 models. If not, then I'll create a setting to simply not use cache_type_k on those deviants.

I'll also try to confirm if it has to do with models being quantized to Q3 / Q5 instead of Q4.

flatsiedatsie commented 1 month ago

Tested Rocket 3B Q4 quant, but that also crashed. So it doesn't seem related to models being Q4 themselves.

flatsiedatsie commented 1 month ago

I can only conclude that, as you say, for some models it just randomly doesn't work.

I'll create a variable in my code that enables or disables cache_type_k as needed.