Closed brknkfr closed 2 days ago
Maybe related to https://github.com/mudler/LocalAI/issues/4170
Debug log:
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:dolphin-2.7-mixtral-8x7b.Q5_K_M.gguf ContextSize:8192 Seed:762348231 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:30 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/build/models/dolphin-2.7-mixtral-8x7b.Q5_K_M.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 LoadFormat: MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false ModelPath:/build/models LoraAdapters:[] LoraScales:[]}
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr ggml_cuda_init: found 2 CUDA devices:
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr Device 1: Tesla P40, compute capability 6.1, VMM: yes
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 20812 MiB free
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_load_model_from_file: using device CUDA1 (Tesla P40) - 24290 MiB free
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: loaded meta data with 24 key-value pairs and 995 tensors from /build/models/dolphin-2.7-mixtral-8x7b.Q5_K_M.gguf (version GGUF V3 (latest))
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 0: general.architecture str = llama
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 1: general.name str = cognitivecomputations_dolphin-2.7-mix...
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 2: llama.context_length u32 = 32768
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 4: llama.block_count u32 = 32
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 9: llama.expert_count u32 = 8
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 10: llama.expert_used_count u32 = 2
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 11: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 12: llama.rope.freq_base f32 = 1000000.000000
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 13: general.file_type u32 = 17
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,32002] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,32002] = [0.000000, 0.000000, 0.000000, 0.0000...
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,32002] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 1
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 32000
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 22: tokenizer.chat_template str = {% if not add_generation_prompt is de...
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - kv 23: general.quantization_version u32 = 2
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - type f32: 65 tensors
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - type f16: 32 tensors
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - type q8_0: 64 tensors
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - type q5_K: 833 tensors
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_loader: - type q6_K: 1 tensors
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_vocab: control token: 2 '</s>' is not marked as EOG
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_vocab: control token: 1 '<s>' is not marked as EOG
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_vocab: special tokens cache size = 5
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_vocab: token to piece cache size = 0.1637 MB
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: format = GGUF V3 (latest)
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: arch = llama
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: vocab type = SPM
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_vocab = 32002
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_merges = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: vocab_only = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_ctx_train = 32768
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_embd = 4096
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_layer = 32
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_head = 32
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_head_kv = 8
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_rot = 128
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_swa = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_embd_head_k = 128
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_embd_head_v = 128
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_gqa = 4
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_embd_k_gqa = 1024
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_embd_v_gqa = 1024
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: f_norm_eps = 0.0e+00
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: f_logit_scale = 0.0e+00
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_ff = 14336
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_expert = 8
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_expert_used = 2
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: causal attn = 1
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: pooling type = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: rope type = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: rope scaling = linear
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: freq_base_train = 1000000.0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: freq_scale_train = 1
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: n_ctx_orig_yarn = 32768
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: rope_finetuned = unknown
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: ssm_d_conv = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: ssm_d_inner = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: ssm_d_state = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: ssm_dt_rank = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: ssm_dt_b_c_rms = 0
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: model type = 8x7B
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: model ftype = Q5_K - Medium
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: model params = 46.70 B
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: model size = 30.02 GiB (5.52 BPW)
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: general.name = cognitivecomputations_dolphin-2.7-mixtral-8x7b
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: BOS token = 1 '<s>'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: EOS token = 32000 '<|im_end|>'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: EOT token = 32000 '<|im_end|>'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: UNK token = 0 '<unk>'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: LF token = 13 '<0x0A>'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: EOG token = 32000 '<|im_end|>'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llm_load_print_meta: max token length = 48
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_model_load: error loading model: missing tensor 'blk.0.ffn_down_exps.weight'
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr llama_load_model_from_file: failed to load model
Nov 26 20:52:05 server local-ai[800377]: 7:52PM DBG GRPC(server-127.0.0.1:36099): stderr common_init_from_params: failed to load model '/build/models/dolphin-2.7-mixtral-8x7b.Q5_K_M.gguf'
What does llama_model_load: error loading model: missing tensor 'blk.0.ffn_down_exps.weight'
mean?
Ah, most probably related to https://github.com/ggerganov/llama.cpp/issues/10244
... fixed it by using a newer gguf from https://huggingface.co/mradermacher/dolphin-2.7-mixtral-8x7b-GGUF
LocalAI version: 2.23.0, podman installation, tried with
latest-gpu-nvidia-cuda-12
andlatest-aio-gpu-nvidia-cuda-12
` Environment, CPU architecture, OS, and Version: Standard Debian 12 (96 GB memory) with two Nvidia Tesla P40 gpus (each 24 GB memory)Describe the bug LocalAI fails to load custom .gguf files. In this case its
dolphin-2.5-mixtral-8x7b.Q5_K_M.gguf
. Following error message appears with all backendsERR [llama-cpp] Failed loading model, trying with fallback 'llama-cpp-fallback', error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =
It loads and works without issue on version 2.22.1.
To Reproduce Update LocalAI images to version 2.23.0 and load the "custom" model
dolphin-2.7-mixtral-8x7b.Q5_K_M.gguf
.Expected behavior Loading should work.
Logs Multiple lines of following messages for all backends.
ERR [llama-cpp] Failed loading model, trying with fallback 'llama-cpp-fallback', error: failed to load model with internal loader: could not load model: rpc error: code = Canceled desc =