I would like to change the settings of llama-cpp in detail, but which file should I change?

Taikono-Himazin commented 5 months ago

I'm using docker's latest-aio-gpu-nvidia-cuda-12. I'm using multiple GPUs.

I am modifying aio/gpu-8g/text-to-text.yaml, but the configuration items described here https://localai.io/advanced/#full-config-model-file-reference are sufficient. Is not ...

Specifically, I'm running into the issue https://github.com/ggerganov/llama.cpp/issues/6603. I want to set --ctx-size in this solution, but I don't know if context_size in yaml has the same meaning. There are also other settings that you may want to change.

localai-bot commented 5 months ago

To change the settings of llama-cpp, you will need to modify the text-to-text.yaml file under the aio/gpu-8g/ directory. Here's how you can do it:

Open the text-to-text.yaml file in a text editor.
Look for the model section, which contains the various parameters of the llama-cpp model.
Find the specific setting you would like to change. If you want to set the --ctx-size flag, look for the context_size parameter. Note that in the YAML file, this parameter might be present under model.llama.ctx_size or model.llama_cpp.ctx_size.
Modify the value of the parameter to your desired setting.
Save the changes and close the file.
Restart your Docker container or re-run your script for the changes to take effect.

If you want to change other settings as well, look for the relevant parameters in the model section and modify them accordingly. Make sure you understand the effects of each parameter before modifying it.

Regarding the context_size parameter, you can check the llama-cpp documentation or the GitHub issue you mentioned (https://github.com/ggerganov/llama.cpp/issues/6603) to see if context_size in the YAML file corresponds to the --ctx-size flag. If there's any difference, follow the instructions provided in the issue for setting the appropriate parameter.

Remember to test your model after making any changes to ensure that everything is working as expected.

Taikono-Himazin commented 4 months ago

@localai-bot Where did the information for model.llama_cpp.ctx_size come from? Is it mentioned in the documentation?

localai-bot commented 4 months ago

The model.llama_cpp.ctx_size parameter likely came from the documentation or configuration examples provided by the llama-cpp developers. However, it might also be derived from best practices shared by users within the community, based on their own experimentation. To be sure of its origin or to find the most up-to-date information, I suggest checking the official llama-cpp documentation or the GitHub repository (https://github.com/ggerganov/llama.cpp). It's also a good idea to look into any specific configuration examples provided by LocalAI for the text-to-text.yaml file you are using, as they may have documented this parameter's source.

If you can't find the information you need, feel free to ask directly in the llama-cpp community or on the LocalAI forum for further assistance.

Taikono-Himazin commented 4 months ago

context_size in text-to-text.yaml doesn't seem to be working.

Below is the log when context_size: 4096.

``` stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb (version GGUF V3 (latest)) stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. stderr llama_model_loader: - kv 0: general.architecture str = llama stderr llama_model_loader: - kv 1: general.name str = jeffq stderr llama_model_loader: - kv 2: llama.context_length u32 = 32768 stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 stderr llama_model_loader: - kv 4: llama.block_count u32 = 32 stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 stderr llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 stderr llama_model_loader: - kv 11: general.file_type u32 = 18 stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = llama stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32032] = ["", "~~", "~~", "<0x00>", "<... stderr llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32032] = [0.000000, 0.000000, 0.000000, 0.0000... stderr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32032] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 stderr llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true stderr llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false stderr llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... stderr llama_model_loader: - kv 21: general.quantization_version u32 = 2 stderr llama_model_loader: - type f32: 65 tensors stderr llama_model_loader: - type q6_K: 226 tensors stderr llm_load_vocab: special tokens definition check successful ( 291/32032 ). stderr llm_load_print_meta: format = GGUF V3 (latest) stderr llm_load_print_meta: arch = llama stderr llm_load_print_meta: vocab type = SPM stderr llm_load_print_meta: n_vocab = 32032 stderr llm_load_print_meta: n_merges = 0 stderr llm_load_print_meta: n_ctx_train = 32768 stderr llm_load_print_meta: n_embd = 4096 stderr llm_load_print_meta: n_head = 32 stderr llm_load_print_meta: n_head_kv = 8 stderr llm_load_print_meta: n_layer = 32 stderr llm_load_print_meta: n_rot = 128 stderr llm_load_print_meta: n_embd_head_k = 128 stderr llm_load_print_meta: n_embd_head_v = 128 stderr llm_load_print_meta: n_gqa = 4 stderr llm_load_print_meta: n_embd_k_gqa = 1024 stderr llm_load_print_meta: n_embd_v_gqa = 1024 stderr llm_load_print_meta: f_norm_eps = 0.0e+00 stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 stderr llm_load_print_meta: f_logit_scale = 0.0e+00 stderr llm_load_print_meta: n_ff = 14336 stderr llm_load_print_meta: n_expert = 0 stderr llm_load_print_meta: n_expert_used = 0 stderr llm_load_print_meta: causal attn = 1 stderr llm_load_print_meta: pooling type = 0 stderr llm_load_print_meta: rope type = 0 stderr llm_load_print_meta: rope scaling = linear stderr llm_load_print_meta: freq_base_train = 10000.0 stderr llm_load_print_meta: freq_scale_train = 1 stderr llm_load_print_meta: n_yarn_orig_ctx = 32768 stderr llm_load_print_meta: rope_finetuned = unknown stderr llm_load_print_meta: ssm_d_conv = 0 stderr llm_load_print_meta: ssm_d_inner = 0 stderr llm_load_print_meta: ssm_d_state = 0 stderr llm_load_print_meta: ssm_dt_rank = 0 stderr llm_load_print_meta: model type = 7B stderr llm_load_print_meta: model ftype = Q6_K stderr llm_load_print_meta: model params = 7.24 B stderr llm_load_print_meta: model size = 5.53 GiB (6.56 BPW) stderr llm_load_print_meta: general.name = jeffq stderr llm_load_print_meta: BOS token = 1 '' stderr llm_load_print_meta: EOS token = 32000 '<|im_end|>' stderr llm_load_print_meta: UNK token = 0 '' stderr llm_load_print_meta: LF token = 13 '<0x0A>' stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes stderr ggml_cuda_init: found 9 CUDA devices: stderr Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 6: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 7: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 8: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr llm_load_tensors: ggml ctx size = 1.11 MiB stderr llm_load_tensors: offloading 32 repeating layers to GPU stderr llm_load_tensors: offloading non-repeating layers to GPU stderr llm_load_tensors: offloaded 33/33 layers to GPU stderr llm_load_tensors: CPU buffer size = 102.64 MiB stderr llm_load_tensors: CUDA0 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA1 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA2 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA3 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA4 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA5 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA6 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA7 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA8 buffer size = 443.97 MiB stderr .................................................................................................... stderr llama_new_context_with_model: n_ctx = 4096 stderr llama_new_context_with_model: n_batch = 512 stderr llama_new_context_with_model: n_ubatch = 512 stderr llama_new_context_with_model: freq_base = 10000.0 stderr llama_new_context_with_model: freq_scale = 1 stderr llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA1 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA2 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA3 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA4 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA5 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA6 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA7 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA8 KV buffer size = 32.00 MiB stderr llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB stderr llama_new_context_with_model: CUDA_Host output buffer size = 0.73 MiB stderr llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) stderr llama_new_context_with_model: CUDA0 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA1 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA2 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA3 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA4 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA5 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA6 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA7 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA8 compute buffer size = 352.02 MiB stderr llama_new_context_with_model: CUDA_Host compute buffer size = 40.02 MiB stderr llama_new_context_with_model: graph nodes = 1030 stderr llama_new_context_with_model: graph splits = 10 stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":6} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":1,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":2,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":3,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":4,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":5,"n_ctx_slot":682} ```

Below is the log when context_size: 512.
``` stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb (version GGUF V3 (latest)) stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. stderr llama_model_loader: - kv 0: general.architecture str = llama stderr llama_model_loader: - kv 1: general.name str = jeffq stderr llama_model_loader: - kv 2: llama.context_length u32 = 32768 stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 stderr llama_model_loader: - kv 4: llama.block_count u32 = 32 stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 stderr llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 stderr llama_model_loader: - kv 11: general.file_type u32 = 18 stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = llama stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32032] = ["", "~~", "~~", "<0x00>", "<... stderr llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32032] = [0.000000, 0.000000, 0.000000, 0.0000... stderr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32032] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 stderr llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true stderr llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false stderr llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... stderr llama_model_loader: - kv 21: general.quantization_version u32 = 2 stderr llama_model_loader: - type f32: 65 tensors stderr llama_model_loader: - type q6_K: 226 tensors stderr llm_load_vocab: special tokens definition check successful ( 291/32032 ). stderr llm_load_print_meta: format = GGUF V3 (latest) stderr llm_load_print_meta: arch = llama stderr llm_load_print_meta: vocab type = SPM stderr llm_load_print_meta: n_vocab = 32032 stderr llm_load_print_meta: n_merges = 0 stderr llm_load_print_meta: n_ctx_train = 32768 stderr llm_load_print_meta: n_embd = 4096 stderr llm_load_print_meta: n_head = 32 stderr llm_load_print_meta: n_head_kv = 8 stderr llm_load_print_meta: n_layer = 32 stderr llm_load_print_meta: n_rot = 128 stderr llm_load_print_meta: n_embd_head_k = 128 stderr llm_load_print_meta: n_embd_head_v = 128 stderr llm_load_print_meta: n_gqa = 4 stderr llm_load_print_meta: n_embd_k_gqa = 1024 stderr llm_load_print_meta: n_embd_v_gqa = 1024 stderr llm_load_print_meta: f_norm_eps = 0.0e+00 stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 stderr llm_load_print_meta: f_logit_scale = 0.0e+00 stderr llm_load_print_meta: n_ff = 14336 stderr llm_load_print_meta: n_expert = 0 stderr llm_load_print_meta: n_expert_used = 0 stderr llm_load_print_meta: causal attn = 1 stderr llm_load_print_meta: pooling type = 0 stderr llm_load_print_meta: rope type = 0 stderr llm_load_print_meta: rope scaling = linear stderr llm_load_print_meta: freq_base_train = 10000.0 stderr llm_load_print_meta: freq_scale_train = 1 stderr llm_load_print_meta: n_yarn_orig_ctx = 32768 stderr llm_load_print_meta: rope_finetuned = unknown stderr llm_load_print_meta: ssm_d_conv = 0 stderr llm_load_print_meta: ssm_d_inner = 0 stderr llm_load_print_meta: ssm_d_state = 0 stderr llm_load_print_meta: ssm_dt_rank = 0 stderr llm_load_print_meta: model type = 7B stderr llm_load_print_meta: model ftype = Q6_K stderr llm_load_print_meta: model params = 7.24 B stderr llm_load_print_meta: model size = 5.53 GiB (6.56 BPW) stderr llm_load_print_meta: general.name = jeffq stderr llm_load_print_meta: BOS token = 1 '' stderr llm_load_print_meta: EOS token = 32000 '<|im_end|>' stderr llm_load_print_meta: UNK token = 0 '' stderr llm_load_print_meta: LF token = 13 '<0x0A>' stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes stderr ggml_cuda_init: found 9 CUDA devices: stderr Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 6: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 7: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 8: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr llm_load_tensors: ggml ctx size = 1.11 MiB stderr llm_load_tensors: offloading 32 repeating layers to GPU stderr llm_load_tensors: offloading non-repeating layers to GPU stderr llm_load_tensors: offloaded 33/33 layers to GPU stderr llm_load_tensors: CPU buffer size = 102.64 MiB stderr llm_load_tensors: CUDA0 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA1 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA2 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA3 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA4 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA5 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA6 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA7 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA8 buffer size = 443.97 MiB stderr .................................................................................................... stderr llama_new_context_with_model: n_ctx = 4096 stderr llama_new_context_with_model: n_batch = 512 stderr llama_new_context_with_model: n_ubatch = 512 stderr llama_new_context_with_model: freq_base = 10000.0 stderr llama_new_context_with_model: freq_scale = 1 stderr llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA1 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA2 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA3 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA4 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA5 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA6 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA7 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA8 KV buffer size = 32.00 MiB stderr llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB stderr llama_new_context_with_model: CUDA_Host output buffer size = 0.73 MiB stderr llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) stderr llama_new_context_with_model: CUDA0 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA1 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA2 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA3 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA4 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA5 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA6 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA7 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA8 compute buffer size = 352.02 MiB stderr llama_new_context_with_model: CUDA_Host compute buffer size = 40.02 MiB stderr llama_new_context_with_model: graph nodes = 1030 stderr llama_new_context_with_model: graph splits = 10 stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":6} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":1,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":2,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":3,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":4,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":5,"n_ctx_slot":682} ```

~~Both look the same to me.~~

mudler / LocalAI

I would like to change the settings of llama-cpp in detail, but which file should I change? #2015