Closed mnuppnau closed 2 weeks ago
I had the same problem and I figured out how to fix it. The issue is that, for some odd reason, LangChain has hardcoded default values of rope_freq_scale=1.0
and rope_freq_base=10000
and does not allow llama.cpp to automatically set the appropriate rope values based on the model metadata. Simply set rope_freq_base=500000
and Llama3 will shine again. Now, I am trying to figure out how to prevent LangChain from altering these settings at all.
I had the same problem and I figured out how to fix it. The issue is that, for some odd reason, LangChain has hardcoded default values of
rope_freq_scale=1.0
andrope_freq_base=1000
and does not allow llama.cpp to automatically set the appropriate rope values based on the model metadata. Simply setrope_freq_base=50000
and Llama3 will shine again. Now, I am trying to figure out how to prevent LangChain from altering these settings at all.
I've tried to update the script above with various settings including:
model_kwargs = {'rope_freq_base':50000}
llm = LlamaCpp(
model_path="./models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=n_ctx,
model_kwargs=model_kwargs,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
And the output is changed slightly but is still nonsensical. Am I updating the incorrect parameter?
oops, sorry, it should be 500 000, not 50 000 as i incorrectly said before. But that's for the standard ctx size, for different ctx_size we probably need to recalculate the rope parameters
oops, sorry, it should be 500 000, not 50 000 as i incorrectly said before. But that's for the standard ctx size, for different ctx_size we probably need to recalculate the rope parameters
It appears that the rope_freq_base was already set to 500 000. If you look at my output above, it shows 'llama.rope.freq_base': '500000.000000'
. Here is additional output information:
llama_model_loader: - kv 0: general.architecture str = llama 08:27:46 [31/5329]
llama_model_loader: - kv 1: general.name str = hub
llama_model_loader: - kv 2: llama.vocab_size u32 = 128256
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 8192
llama_model_loader: - kv 5: llama.block_count u32 = 80
llama_model_loader: - kv 6: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 8: llama.attention.head_count u32 = 64
llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 12: general.file_type u32 = 15
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_K: 441 tensors
llama_model_loader: - type q5_K: 40 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 70.55 B
llm_load_print_meta: model size = 39.59 GiB (4.82 BPW)
llm_load_print_meta: general.name = hub
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
llm_load_print_meta: LF token = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 1.10 MiB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 81/81 layers to GPU
llm_load_tensors: CPU buffer size = 563.62 MiB
llm_load_tensors: CUDA0 buffer size = 20038.81 MiB
llm_load_tensors: CUDA1 buffer size = 19940.67 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 1024
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
It seems that no matter how adjust rope-freq-base
, rope_freq_base
, or freq_base
, these values printed out do not change.
I also see the correct value in the output. but when I set value rope_freq_base in Llama.cpp constructor (directly, not using kwargs), the behavior of model changes while in console I still see the same value of 500000. Documentation don't label this parameter as optional and it mention the default value (which is not suitable for llama3)
Dne so 27. 4. 2024 14:32 uživatel mnuppnau @.***> napsal:
oops, sorry, it should be 500 000, not 50 000 as i incorrectly said before. But that's for the standard ctx size, for different ctx_size we probably need to recalculate the rope parameters
It appears that the rope_freq_base was already set to 500 000. If you look at my output above, it shows 'llama.rope.freq_base': '500000.000000'. Here is additional output information:
llama_model_loader: - kv 0: general.architecture str = llama 08:27:46 [31/5329] llama_model_loader: - kv 1: general.name str = hub llama_model_loader: - kv 2: llama.vocab_size u32 = 128256 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 8192 llama_model_loader: - kv 5: llama.block_count u32 = 80 llama_model_loader: - kv 6: llama.feed_forward_length u32 = 28672 llama_model_loader: - kv 7: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 8: llama.attention.head_count u32 = 64 llama_model_loader: - kv 9: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 10: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 11: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 12: general.file_type u32 = 15 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,128256] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128001 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_K: 441 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 81 tensors llm_load_vocab: special tokens definition check successful ( 256/128256 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 70.55 B llm_load_print_meta: model size = 39.59 GiB (4.82 BPW) llm_load_print_meta: general.name = hub llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128001 '<|end_of_text|>' llm_load_print_meta: LF token = 128 'Ä' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 1.10 MiB llm_load_tensors: offloading 80 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 81/81 layers to GPU llm_load_tensors: CPU buffer size = 563.62 MiB llm_load_tensors: CUDA0 buffer size = 20038.81 MiB llm_load_tensors: CUDA1 buffer size = 19940.67 MiB ................................................................................................... llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 1024 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1
It seems that no matter how adjust rope-freq-base, rope_freq_base, or freq_base, these values printed out do not change.
— Reply to this email directly, view it on GitHub https://github.com/langchain-ai/langchain/issues/20710#issuecomment-2080547236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXORV2CCLR6SVZEJMMFTHDY7OLGXAVCNFSM6AAAAABGROD4XKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBQGU2DOMRTGY . You are receiving this because you commented.Message ID: @.***>
I also see the correct value in the output. but when I set value rope_freq_base in Llama.cpp constructor (directly, not using kwargs), the behavior of model changes while in console I still see the same value of 500000. Documentation don't label this parameter as optional and it mention the default value (which is not suitable for llama3) https://api.python.langchain.com/en/latest/llms/langchain_community.llms.llamacpp.LlamaCpp.html#langchain_community.llms.llamacpp.LlamaCpp.rope_freq_base
The following update works now:
rope_freq_base = 500000
max_tokens = 1024
llm = LlamaCpp(
model_path="./models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf",
n_gpu_layers=n_gpu_layers,
n_batch=n_batch,
n_ctx=n_ctx,
rope_freq_base=rope_freq_base,
max_tokens=max_tokens,
callback_manager=callback_manager,
verbose=True, # Verbose is required to pass to the callback manager
)
Thanks!
i just realized that when i set rope-freq-base and rope_freq_base to zero, it let llama.cpp to automaticaly set it based on the model metadata :smile:
thanks for this discussion! Setting rope_freq_base
significantly helped me. More compliant with system prompt than when not set.
Hi, @mnuppnau. I'm helping the LangChain team manage their backlog and am marking this issue as stale.
The issue you raised regarding the Llama 3 model producing nonsensical outputs with long context lengths has been addressed. User strnad identified that the problem was linked to hardcoded default values for rope_freq_scale
and rope_freq_base
, and suggested setting rope_freq_base
to 500,000, which has been confirmed by you and others to improve model performance.
Could you please let the LangChain team know if this issue is still relevant to the latest version of the repository? If it is, feel free to comment here. Otherwise, you can close the issue yourself, or it will be automatically closed in 7 days. Thank you!
Checked other resources
Example Code
The following code demonstrates the issue:
Error Message and Stack Trace (if applicable)
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}", 'tokenizer.ggml.eos_token_id': '128001', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'llama.context_len gth': '8192', 'general.name': 'hub', 'llama.vocab_size': '128256', 'general.file_type': '15', 'llama.embedding_length': '8192', 'llama.feed_forward_length': '28672', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'tokenizer.ggml.bos_token_id': '128000', 'll ama.attention.head_count': '64', 'llama.block_count': '80', 'llama.attention.head_count_kv': '8'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}
Using chat eos_token: <|end_of_text|>
Using chat bos_token: <|begin_of_text|>
the a for the " example: this, an or not . ( the more every which the what, example the the other that. each that about the the the a
The to _ more so an in the this to for and an this all a any a a the in the and the to a a such the, to and a all that
llama_print_timings: load time = 1044.99 ms
llama_print_timings: sample time = 26.47 ms / 70 runs ( 0.38 ms per token, 2644.50 tokens per second)
llama_print_timings: prompt eval time = 21789.86 ms / 8122 tokens ( 2.68 ms per token, 372.74 tokens per second)
llama_print_timings: eval time = 4749.77 ms / 69 runs ( 68.84 ms per token, 14.53 tokens per second)
llama_print_timings: total time = 26954.58 ms / 8191 tokens
Description
I'm trying to use langchain, LlamaCpp and LLMChain, to generate output from Meta's new Llama 3 models. I've tried various types of models, all with the same issue. The models perform well on text of token length around 3k and less. When the token length is increased, the output becomes nonsensical. I am able to successfully run llama.cpp main command in interactive mode and get meaningful output when pasting 8k tokens in the terminal.
System Info
I've tried this on various systems, here is one:
System Information
Package Information