oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.54k stars 5.2k forks source link

RuntimeWarning: Detected duplicate leading "<|begin_of_text|>" in prompt #6225

Open Kaszebe opened 2 months ago

Kaszebe commented 2 months ago

Describe the bug

Whenever I load up certain GGUFs, I get the above error message in the terminal. I have seen it happen on Bartowski Q8 quant of Llama3 70B Instruct (3-part file) and llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf.

Is there an existing issue for this?

Reproduction

I cannot recall the URL of the quant page on Huggingface. I just know it is this one: llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf and llama-3-70B-Instruct-abliterated-Q6_K-00002-of-00002.gguf

Load it up in Oobabooga and send the LLM a message. You'll notice you get the following error message

/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores/llama.py:1054: RuntimeWarning: Detected duplicate leading "<|begin_of_text|>" in prompt, this will likely reduce response quality, consider removing it...

Screenshot

No response

Logs

20:30:36-728949 INFO     Starting Text generation web UI                        

Running on local URL:  http://127.0.0.1:7860

20:31:03-885646 INFO     Loading                                                
                         "llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.g
                         guf"                                                   
20:31:04-126349 INFO     llama.cpp weights detected:                            
                         "models/llama-3-70B-Instruct-abliterated-Q6_K-00001-of-
                         00002.gguf"                                            
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 29 key-value pairs and 723 tensors from models/llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-3-70B-Instruct-abliterated
llama_model_loader: - kv   2:                          llama.block_count u32              = 80
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/llama-3-70B-Instruct-ablitera...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/calibration_data.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 189
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 2
llama_model_loader: - kv  28:                        split.tensors.count i32              = 723
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q6_K:  562 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 53.91 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = llama-3-70B-Instruct-abliterated
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
  Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    1.69 MiB
llm_load_tensors: offloading 75 repeating layers to GPU
llm_load_tensors: offloaded 75/81 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  4991.12 MiB
llm_load_tensors:      CUDA0 buffer size = 12719.31 MiB
llm_load_tensors:      CUDA1 buffer size = 12719.31 MiB
llm_load_tensors:      CUDA2 buffer size = 12719.31 MiB
llm_load_tensors:      CUDA3 buffer size = 12049.88 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 3840
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    75.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   285.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   285.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   285.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   270.00 MiB
llama_new_context_with_model: KV self size  = 1200.00 MiB, K (f16):  600.00 MiB, V (f16):  600.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1088.45 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   147.75 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   147.75 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   163.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    23.51 MiB
llama_new_context_with_model: graph nodes  = 2247
llama_new_context_with_model: graph splits = 62
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
Model metadata: {'split.count': '2', 'split.no': '0', 'quantize.imatrix.entries_count': '560', 'quantize.imatrix.dataset': '/training_data/calibration_data.txt', 'quantize.imatrix.chunks_count': '189', 'quantize.imatrix.file': '/models/llama-3-70B-Instruct-abliterated-GGUF/llama-3-70B-Instruct-abliterated.imatrix', 'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128001', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'split.tensors.count': '723', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'llama-3-70B-Instruct-abliterated', 'llama.embedding_length': '8192', 'llama.feed_forward_length': '28672', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '64', 'llama.block_count': '80', 'llama.attention.head_count_kv': '8', 'general.file_type': '18', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Available chat formats from metadata: chat_template.default
Guessed chat format: llama-3
20:32:02-637567 INFO     Loaded "llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf" in 58.75 seconds.                                       
20:32:02-638457 INFO     LOADER: "llama.cpp"                                                                                                        
20:32:02-638996 INFO     TRUNCATION LENGTH: 3840                                                                                                    
20:32:02-639512 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                                                              
20:33:10-196296 INFO     "My Preset" preset:                                                                                                        
{   'temperature': 0.11,
    'min_p': 0.05,
    'repetition_penalty': 1.05,
    'frequency_penalty': 0.1}
20:33:16-795472 INFO     Saved "/home/kot/text-generation-webui/presets/My Preset.yaml".                                                          
20:33:36-874454 INFO     Deleted "logs/chat/Copywriter/20240711-19-49-28.json".                                                                     
20:33:41-280313 ERROR    Failed to build the chat prompt. The input is too long for the available context length.                                   

                         Truncation length: 3840                                                                                                    
                         max_new_tokens: 2950 (is it too high?)                                                                                     
                         Available context length: 890                                                                                              

Traceback (most recent call last):
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/queueing.py", line 566, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/route_utils.py", line 261, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1786, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/blocks.py", line 1350, in call_function
    prediction = await utils.async_iteration(iterator)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 583, in async_iteration
    return await iterator.__anext__()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 576, in __anext__
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 2177, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 859, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 559, in run_sync_iterator_async
    return next(iterator)
           ^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/gradio/utils.py", line 742, in gen_wrapper
    response = next(iterator)
               ^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/modules/chat.py", line 424, in generate_chat_reply_wrapper
    for i, history in enumerate(generate_chat_reply(text, state, regenerate, _continue, loading_message=True, for_ui=True)):
  File "/home/kot/text-generation-webui/modules/chat.py", line 392, in generate_chat_reply
    for history in chatbot_wrapper(text, state, regenerate=regenerate, _continue=_continue, loading_message=loading_message, for_ui=for_ui):
  File "/home/kot/text-generation-webui/modules/chat.py", line 336, in chatbot_wrapper
    prompt = generate_chat_prompt(text, state, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/kot/text-generation-webui/modules/chat.py", line 224, in generate_chat_prompt
    raise ValueError
ValueError
20:35:40-523627 INFO     Loading "llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf"                                                        
20:35:40-756443 INFO     llama.cpp weights detected: "models/llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf"                             
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 29 key-value pairs and 723 tensors from models/llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-3-70B-Instruct-abliterated
llama_model_loader: - kv   2:                          llama.block_count u32              = 80
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128001
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = /models/llama-3-70B-Instruct-ablitera...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = /training_data/calibration_data.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 560
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 189
llama_model_loader: - kv  26:                                   split.no u16              = 0
llama_model_loader: - kv  27:                                split.count u16              = 2
llama_model_loader: - kv  28:                        split.tensors.count i32              = 723
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q6_K:  562 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 70.55 B
llm_load_print_meta: model size       = 53.91 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = llama-3-70B-Instruct-abliterated
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    1.69 MiB
llm_load_tensors: offloading 75 repeating layers to GPU
llm_load_tensors: offloaded 75/81 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =  4991.12 MiB
llm_load_tensors:      CUDA0 buffer size = 12049.88 MiB
llm_load_tensors:      CUDA1 buffer size = 12719.31 MiB
llm_load_tensors:      CUDA2 buffer size = 12719.31 MiB
llm_load_tensors:      CUDA3 buffer size = 12719.31 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 3840
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    75.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   270.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   285.00 MiB
llama_kv_cache_init:      CUDA2 KV buffer size =   285.00 MiB
llama_kv_cache_init:      CUDA3 KV buffer size =   285.00 MiB
llama_new_context_with_model: KV self size  = 1200.00 MiB, K (f16):  600.00 MiB, V (f16):  600.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1088.45 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   147.75 MiB
llama_new_context_with_model:      CUDA2 compute buffer size =   147.75 MiB
llama_new_context_with_model:      CUDA3 compute buffer size =   163.75 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    23.51 MiB
llama_new_context_with_model: graph nodes  = 2247
llama_new_context_with_model: graph splits = 62
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | 
Model metadata: {'split.count': '2', 'split.no': '0', 'quantize.imatrix.entries_count': '560', 'quantize.imatrix.dataset': '/training_data/calibration_data.txt', 'quantize.imatrix.chunks_count': '189', 'quantize.imatrix.file': '/models/llama-3-70B-Instruct-abliterated-GGUF/llama-3-70B-Instruct-abliterated.imatrix', 'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128001', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'split.tensors.count': '723', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'llama-3-70B-Instruct-abliterated', 'llama.embedding_length': '8192', 'llama.feed_forward_length': '28672', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '64', 'llama.block_count': '80', 'llama.attention.head_count_kv': '8', 'general.file_type': '18', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Available chat formats from metadata: chat_template.default
Guessed chat format: llama-3
20:36:33-849869 INFO     Loaded "llama-3-70B-Instruct-abliterated-Q6_K-00001-of-00002.gguf" in 53.32       
                         seconds.                                                                          
20:36:33-850664 INFO     LOADER: "llama.cpp"                                                               
20:36:33-851151 INFO     TRUNCATION LENGTH: 3840                                                           
20:36:33-851628 INFO     INSTRUCTION TEMPLATE: "Custom (obtained from model metadata)"                     
20:37:03-840480 INFO     Saved "/home/kot/text-generation-webui/presets/My Preset.yaml".                 
20:37:18-945337 INFO     Deleted "logs/chat/Copywriter/20240711-20-33-36.json".                            
/home/kot/text-generation-webui/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores/llama.py:1054: RuntimeWarning: Detected duplicate leading "<|begin_of_text|>" in prompt, this will likely reduce response quality, consider removing it...
  warnings.warn(

System Info

EPYC 7F52
ASRock Rack ROMED8-2T
4090
4080
3090
3090
32gb RDIMM 3200
SlapDrone commented 1 month ago

+1, but pretty sure this is a llama.cpp issue, I get the same issue with certain ggufs (Llama 3 8B, Gemma 2 9B).

imancrsrk commented 1 week ago

+1 while using llama.cpp with llama-2-7b-chat