Llama 3.1 GGUF incompatibility using latest release of llama.cpp and text-generation-webui.

watchfoxie commented 1 month ago

Describe the bug

I have downloaded Hugging Face "meta-llama/Meta-Llama-3.1-8B-Instruct" model to do Q8_0 type quantization using the latest llama.cpp to keep it up-to-date, increase efficiency and remove the shortcomings of the old quantization. However, the problem of not being able to load the newly quantized model arises.

I thought to try ready quantized model from experienced publisher (Bartowki), maybe I admit mistakes in the process, same error result when uploading model to Web UI. Older GGUF models of "meta-llama/Meta-Llama-3.1-8B-Instruct" works fine.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Clone recent llama.cpp backend.
Create Conda env for quantize lab purpose and install llama.cpp requirements.
Download latest release, currently B3499.
Download Official Llama 3.1 Instruct model from Meta organization using huggingface_hub.
Convert HF to F16 GGUF: python convert-hf-to-gguf.py --outtype f16
Choose Q8_0 quantization for optimal speed/quality token generation (my case): .\llama-quantize.exe --leave-output-tensor --output-tensor-type F16 --token-embedding-type F16 Q8_0
Cut and paste the model to the following directory: text-generation-webui\models.
Launch Web UI with personalized command-line flags (my case): python server.py --flash-attn --tensorcores.
Load the model.

Screenshot

llama

Logs

Command-line logs:
(ai) PS C:\Users\watchfoxie\text-generation-webui> python server.py --flash-attn --tensorcores
10:42:45-900395 INFO     Starting Text generation web UI

Running on local URL:  http://127.0.0.1:7860

10:43:58-310151 INFO     Loading "Llama-3.1-8B-Instr-B3490-broken.gguf"
10:43:58-740119 INFO     llama.cpp weights detected: "models\Llama-3.1-8B-Instr-B3490-broken.gguf"
llama_model_loader: loaded meta data with 28 key-value pairs and 292 tensors from models\Llama-3.1-8B-Instr-B3490-broken.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = 8c22764a7e3675c50d4c7c9a4edb474456022b16
llama_model_loader: - kv   3:                           general.finetune str              = 8c22764a7e3675c50d4c7c9a4edb474456022b16
llama_model_loader: - kv   4:                         general.size_label str              = 8.0B
llama_model_loader: - kv   5:                            general.license str              = llama3.1
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 32
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["─а ─а", "─а ─а─а─а", "─а─а ─а─а", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q8_0:  225 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 8.41 GiB (8.99 BPW)
llm_load_print_meta: general.name     = 8c22764a7e3675c50d4c7c9a4edb474456022b16
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 '├Д'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
10:43:59-878110 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "C:\Users\Marian\text-generation-webui\modules\ui_model_menu.py", line 231, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Marian\text-generation-webui\modules\models.py", line 93, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Marian\text-generation-webui\modules\models.py", line 274, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Marian\text-generation-webui\modules\llamacpp_model.py", line 85, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "C:\Users\Marian\anaconda3\envs\ai\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 372, in __init__
    _LlamaModel(
  File "C:\Users\Marian\anaconda3\envs\ai\Lib\site-packages\llama_cpp_cuda_tensorcores\_internals.py", line 55, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models\Llama-3.1-8B-Instr-B3490-broken.gguf

Exception ignored in: <function Llama.__del__ at 0x000001E5BFC9A480>
Traceback (most recent call last):
  File "C:\Users\Marian\anaconda3\envs\ai\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 2089, in __del__
    if self._lora_adapter is not None:
       ^^^^^^^^^^^^^^^^^^
AttributeError: 'Llama' object has no attribute '_lora_adapter'
Exception ignored in: <function LlamaCppModel.__del__ at 0x000001E5BE899A80>
Traceback (most recent call last):
  File "C:\Users\Marian\text-generation-webui\modules\llamacpp_model.py", line 33, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'