guinmoon / LLMFarm

llama and other large language models on iOS and MacOS offline using GGML library.
MIT License
1.05k stars 62 forks source link

The development build will either crash or produce incorrect output content. #52

Closed luionTW closed 2 months ago

luionTW commented 3 months ago

Hi Guinmoon,

I am trying to build the LLMFarm project with Xcode on my end, but it crashes when I load many models, while a few models work successfully but produce incorrect output. Could you please help me take a look?

Here is my environment:

Device: iPhone 15 Pro, iOS 17.2 Models: tinnyllama-1.1b (Crashes immediately) orca-mini-3b (Produces incorrect output) phi-2 (Crashes immediately)

I didn't modify any code; I just triggered the build and ran the project. I've confirmed that the entitlements for memory and VM are already added. I also tried several versions, like 0.9 and the latest version, but I still get the same results.

Thank you.

luionTW commented 3 months ago

And I also confirmed that I've used these models with the correct settings template.

guinmoon commented 3 months ago

Well, if you run LLMFarm from XCode you should be able to see where the error occurs. I would be very grateful if you could send me a more detailed description of the errors.

luionTW commented 3 months ago

Well, although some crashes don't show errors, I tried to capture some logs while it was crashing: ex. tinnyllama-1.1b: Metal is on / MLock is off / Mmap is on

AI init llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from /var/mobile/Containers/Data/Application/03B8B998-F4AA-4BF3-9E0E-A82E061A1CC1/Documents/models/tinyllama-1.1b-chat-v1.0.Q8_0.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: str = tinyllama_tinyllama-1.1b-chat-v1.0 llama_model_loader: - kv 2: llama.context_length u32 = 2048 llama_model_loader: - kv 3: llama.embedding_length u32 = 2048 llama_model_loader: - kv 4: llama.block_count u32 = 22 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 5632 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 64 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 4 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 llama_model_loader: - kv 11: general.file_type u32 = 7 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m... llama_model_loader: - kv 22: general.quantization_version u32 = 2 llama_model_loader: - type f32: 45 tensors llama_model_loader: - type q8_0: 156 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 22 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 5632 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q8_0 llm_load_print_meta: model params = 1.10 B llm_load_print_meta: model size = 1.09 GiB (8.50 BPW) llm_load_print_meta: = tinyllama_tinyllama-1.1b-chat-v1.0 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 2 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.15 MiB ggml_backend_metal_buffer_from_ptr: allocated buffer, size = 1114.92 MiB, ( 1115.00 / 5461.34) llm_load_tensors: offloading 22 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 23/23 layers to GPU llm_load_tensors: CPU buffer size = 66.41 MiB llm_load_tensors: Metal buffer size = 1114.92 MiB llama_new_context_with_model: n_ctx = 1024 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 ggml_metal_init: allocating ggml_metal_init: picking default device: Apple A17 Pro GPU ggml_metal_init: default.metallib not found, loading from source ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil ggml_metal_init: loading '/var/containers/Bundle/Application/13501DF2-09F1-454E-B12B-62CE7A418F5D/' ggml_metal_init: GPU name: Apple A17 Pro GPU ggml_metal_init: GPU family: MTLGPUFamilyApple9 (1009) ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_init: GPU family: MTLGPUFamilyMetal3 (5001) ggml_metal_init: simdgroup reduction support = true ggml_metal_init: simdgroup matrix mul. support = true ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 5726.63 MB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 22.00 MiB, ( 1139.88 / 5461.34) llama_kv_cache_init: Metal KV buffer size = 22.00 MiB llama_new_context_with_model: KV self size = 22.00 MiB, K (f16): 11.00 MiB, V (f16): 11.00 MiB llama_new_context_with_model: CPU input buffer size = 6.01 MiB ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 82.02 MiB, ( 1221.89 / 5461.34) llama_new_context_with_model: Metal compute buffer size = 82.00 MiB llama_new_context_with_model: CPU compute buffer size = 4.00 MiB llama_new_context_with_model: graph splits (measure): 3 %s: seed = %d 0 AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | Logits inited. ModelSampleParams(n_batch: 512, temp: 0.9, top_k: 40, top_p: 0.95, tfs_z: 1.0, typical_p: 1.0, repeat_penalty: 1.1, repeat_last_n: 64, frequence_penalty: 0.0, presence_penalty: 0.0, mirostat: 0, mirostat_tau: 5.0, mirostat_eta: 5.0, penalize_nl: true) ModelAndContextParams(model_inference: llmfarm_core.ModelInference.LLama_gguf, context: 1024, parts: -1, seed: 4294967295, n_threads: 6, lora_adapters: [], promptFormat: llmfarm_core.ModelPromptStyle.Custom, custom_prompt_format: "<|user|>{prompt}\n<|assistant|>", system_prompt: "You are a story writing assistant.", f16Kv: true, logitsAll: false, vocabOnly: false, useMlock: false, useMMap: true, embedding: false, processorsConunt: 6, use_metal: true, grammar_path: nil, add_bos_token: false, add_eos_token: false, parse_special_tokens: false, warm_prompt: "\n\n\n", reverse_prompt: [], clip_model: nil)

luionTW commented 3 months ago

And here is the incorrect content example: Orca-mini-3b on iPhone 15 Pro Max Simulator

Simulator Screenshot - iPhone 15 Pro Max - 2024-03-21 at 11 46 47

guinmoon commented 3 months ago

Is this incorrect output in version 1.0.1 or earlier?

guinmoon commented 3 months ago

output of versions before 1.0.1 may be very different due to changes in llama.cpp

guinmoon commented 3 months ago

metal does not work in the simulator and since version 1.0.0 it is disabled there

luionTW commented 3 months ago

The incorrect output is in earlier version (0.9.0 and 1.0.0). I will try it out with 1.0.1. Thanks for the information.

luionTW commented 3 months ago

Hi @guinmoon ,

I've updated to 1.0.1 version, but I'm still getting the incorrect response, Could you please help me understand why or suggest anything I need to fine-tune? Thank you.

iPhone 15 Pro Max Simulator: Phi2 Simulator Screenshot - iPhone 15 Pro Max - 2024-03-21 at 15 32 44

Orca-mini-3b Simulator Screenshot - iPhone 15 Pro Max - 2024-03-21 at 15 33 47

guinmoon commented 3 months ago

i use this template

[System](You are a helpful, respectful and honest assistant. Always answer as helpfully as possible.)
Instruct: {prompt}

image image

luionTW commented 3 months ago

Thanks for the template. I will try it out.