oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.53k stars 5.2k forks source link

Llama 3.1 GGUF incompatibility using latest release of llama.cpp and text-generation-webui. #6301

Open watchfoxie opened 1 month ago

watchfoxie commented 1 month ago

Describe the bug

I have downloaded Hugging Face "meta-llama/Meta-Llama-3.1-8B-Instruct" model to do Q8_0 type quantization using the latest llama.cpp to keep it up-to-date, increase efficiency and remove the shortcomings of the old quantization. However, the problem of not being able to load the newly quantized model arises.

I thought to try ready quantized model from experienced publisher (Bartowki), maybe I admit mistakes in the process, same error result when uploading model to Web UI. Older GGUF models of "meta-llama/Meta-Llama-3.1-8B-Instruct" works fine.

Is there an existing issue for this?

Reproduction

  1. Clone recent llama.cpp backend.
  2. Create Conda env for quantize lab purpose and install llama.cpp requirements.
  3. Download latest release, currently B3499.
  4. Download Official Llama 3.1 Instruct model from Meta organization using huggingface_hub.
  5. Convert HF to F16 GGUF: python convert-hf-to-gguf.py --outtype f16
  6. Choose Q8_0 quantization for optimal speed/quality token generation (my case): .\llama-quantize.exe --leave-output-tensor --output-tensor-type F16 --token-embedding-type F16 Q8_0
  7. Cut and paste the model to the following directory: text-generation-webui\models.
  8. Launch Web UI with personalized command-line flags (my case): python server.py --flash-attn --tensorcores.
  9. Load the model.

Screenshot

llama

Logs

Command-line logs:
(ai) PS C:\Users\watchfoxie\text-generation-webui> python server.py --flash-attn --tensorcores
10:42:45-900395 INFO     Starting Text generation web UI

Running on local URL:  http://127.0.0.1:7860

10:43:58-310151 INFO     Loading "Llama-3.1-8B-Instr-B3490-broken.gguf"
10:43:58-740119 INFO     llama.cpp weights detected: "models\Llama-3.1-8B-Instr-B3490-broken.gguf"
llama_model_loader: loaded meta data with 28 key-value pairs and 292 tensors from models\Llama-3.1-8B-Instr-B3490-broken.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = 8c22764a7e3675c50d4c7c9a4edb474456022b16
llama_model_loader: - kv   3:                           general.finetune str              = 8c22764a7e3675c50d4c7c9a4edb474456022b16
llama_model_loader: - kv   4:                         general.size_label str              = 8.0B
llama_model_loader: - kv   5:                            general.license str              = llama3.1
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 32
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  18:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,280147]  = ["─а ─а", "─а ─а─а─а", "─а─а ─а─а", "...
llama_model_loader: - kv  24:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  25:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  27:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q8_0:  225 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 8.41 GiB (8.99 BPW)
llm_load_print_meta: general.name     = 8c22764a7e3675c50d4c7c9a4edb474456022b16
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 '├Д'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4070 Laptop GPU, compute capability 8.9, VMM: yes
llm_load_tensors: ggml ctx size =    0.27 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
10:43:59-878110 ERROR    Failed to load the model.
Traceback (most recent call last):
  File "C:\Users\Marian\text-generation-webui\modules\ui_model_menu.py", line 231, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(selected_model, loader)
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Marian\text-generation-webui\modules\models.py", line 93, in load_model
    output = load_func_map[loader](model_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Marian\text-generation-webui\modules\models.py", line 274, in llamacpp_loader
    model, tokenizer = LlamaCppModel.from_pretrained(model_file)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Marian\text-generation-webui\modules\llamacpp_model.py", line 85, in from_pretrained
    result.model = Llama(**params)
                   ^^^^^^^^^^^^^^^
  File "C:\Users\Marian\anaconda3\envs\ai\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 372, in __init__
    _LlamaModel(
  File "C:\Users\Marian\anaconda3\envs\ai\Lib\site-packages\llama_cpp_cuda_tensorcores\_internals.py", line 55, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models\Llama-3.1-8B-Instr-B3490-broken.gguf

Exception ignored in: <function Llama.__del__ at 0x000001E5BFC9A480>
Traceback (most recent call last):
  File "C:\Users\Marian\anaconda3\envs\ai\Lib\site-packages\llama_cpp_cuda_tensorcores\llama.py", line 2089, in __del__
    if self._lora_adapter is not None:
       ^^^^^^^^^^^^^^^^^^
AttributeError: 'Llama' object has no attribute '_lora_adapter'
Exception ignored in: <function LlamaCppModel.__del__ at 0x000001E5BE899A80>
Traceback (most recent call last):
  File "C:\Users\Marian\text-generation-webui\modules\llamacpp_model.py", line 33, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

System Info

Operating system: Windows 11 Pro.
GPU: NVIDIA RTX 4070 mobile (8gb vram)
PrometheusDante commented 1 month ago

Running into the same issue

Star-98 commented 1 month ago

I think your VRAM is too small. 3.1 has 128k == 131072 context_length and in case of Q8_0 it's barely enough even with 24G VRAM, try reducing the n_ctx value.

PrometheusDante commented 1 month ago

I have 16gb vram and ran into the same error with settings all the way down to 512 context length and only one layer on the gpu, without difference.

Star-98 commented 1 month ago

Can you show the entire log? I am not professional about this, but others are not. Some of them might be able to help you with something.

PrometheusDante commented 1 month ago

I deleted the model already and got a different one, but I believe it is related to this llama.cpp bug https://github.com/ollama/ollama/issues/6048, which should be resolved by now

nichjamesr commented 1 month ago

Were you able to get it to work using a different model? I've tried a few different GGUF versions and result is the same.

PrometheusDante commented 1 month ago

@nichjamesr sorry, I forgot to add that I got a nice abliterated model through the ollama page to just use with cmd. Works nicely, but unfortunately no solution for this issue here. Should have asked llama about that come to think of it

PrometheusDante commented 1 month ago

Maybe the models made during the buggy llama.cpp version need to get patched themselves as well to be compatible again? Did you try looking for some very new ones, just for testing purposes? Not sure when this was exaclty fixed in llama.cpp, but the newer the more likely it would work, if my guess holds any truth

nichjamesr commented 1 month ago

@PrometheusDante I'm not sure why, but it seems like for me lowering the context actually did the trick. I'm on a 3060ti (8GB). The same model that wouldn't load at 128k loads fine if I set to 64k or below.

watchfoxie commented 1 month ago

I always set standard context length 8096, this is not the cause. Regarding model settings and parameters, I always take care before loading.

So, i found the point of issue, this is the python script "convert_hf_to_gguf.py" one of these commit updates ruined compatibility #8627 or #8676. Temporary solution is to use old llama.cpp backend to create FP16 model, or to take already quantized from HF (example GAIANET one). After that, you can use recent release of B3xxx to obtain desired quant without load issues in Web UI.

Magenta-Flutist commented 1 month ago

I always set standard context length 8096, this is not the cause. Regarding model settings and parameters, I always take care before loading.

So, i found the point of issue, this is the python script "convert_hf_to_gguf.py" one of these commit updates ruined compatibility #8627 or #8676. Temporary solution is to use old llama.cpp backend to create FP16 model, or to take already quantized from HF (example GAIANET one). After that, you can use recent release of B3xxx to obtain desired quant without load issues in Web UI.

Hi, may I ask you to explain how to do it in details?