mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed inference
https://localai.io
MIT License
23.31k stars 1.78k forks source link

I would like to change the settings of llama-cpp in detail, but which file should I change? #2015

Open Taikono-Himazin opened 5 months ago

Taikono-Himazin commented 5 months ago

I'm using docker's latest-aio-gpu-nvidia-cuda-12. I'm using multiple GPUs.

I would like to change the settings of llama-cpp in detail, but which file should I change?

I am modifying aio/gpu-8g/text-to-text.yaml, but the configuration items described here https://localai.io/advanced/#full-config-model-file-reference are sufficient. Is not ...

Specifically, I'm running into the issue https://github.com/ggerganov/llama.cpp/issues/6603. I want to set --ctx-size in this solution, but I don't know if context_size in yaml has the same meaning. There are also other settings that you may want to change.

localai-bot commented 5 months ago

To change the settings of llama-cpp, you will need to modify the text-to-text.yaml file under the aio/gpu-8g/ directory. Here's how you can do it:

  1. Open the text-to-text.yaml file in a text editor.
  2. Look for the model section, which contains the various parameters of the llama-cpp model.
  3. Find the specific setting you would like to change. If you want to set the --ctx-size flag, look for the context_size parameter. Note that in the YAML file, this parameter might be present under model.llama.ctx_size or model.llama_cpp.ctx_size.
  4. Modify the value of the parameter to your desired setting.
  5. Save the changes and close the file.
  6. Restart your Docker container or re-run your script for the changes to take effect.

If you want to change other settings as well, look for the relevant parameters in the model section and modify them accordingly. Make sure you understand the effects of each parameter before modifying it.

Regarding the context_size parameter, you can check the llama-cpp documentation or the GitHub issue you mentioned (https://github.com/ggerganov/llama.cpp/issues/6603) to see if context_size in the YAML file corresponds to the --ctx-size flag. If there's any difference, follow the instructions provided in the issue for setting the appropriate parameter.

Remember to test your model after making any changes to ensure that everything is working as expected.

Taikono-Himazin commented 4 months ago

@localai-bot Where did the information for model.llama_cpp.ctx_size come from? Is it mentioned in the documentation?

localai-bot commented 4 months ago

The model.llama_cpp.ctx_size parameter likely came from the documentation or configuration examples provided by the llama-cpp developers. However, it might also be derived from best practices shared by users within the community, based on their own experimentation. To be sure of its origin or to find the most up-to-date information, I suggest checking the official llama-cpp documentation or the GitHub repository (https://github.com/ggerganov/llama.cpp). It's also a good idea to look into any specific configuration examples provided by LocalAI for the text-to-text.yaml file you are using, as they may have documented this parameter's source.

If you can't find the information you need, feel free to ask directly in the llama-cpp community or on the LocalAI forum for further assistance.

Taikono-Himazin commented 4 months ago

context_size in text-to-text.yaml doesn't seem to be working.

Below is the log when context_size: 4096. ``` stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb (version GGUF V3 (latest)) stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. stderr llama_model_loader: - kv 0: general.architecture str = llama stderr llama_model_loader: - kv 1: general.name str = jeffq stderr llama_model_loader: - kv 2: llama.context_length u32 = 32768 stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 stderr llama_model_loader: - kv 4: llama.block_count u32 = 32 stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 stderr llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 stderr llama_model_loader: - kv 11: general.file_type u32 = 18 stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = llama stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32032] = ["", "", "", "<0x00>", "<... stderr llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32032] = [0.000000, 0.000000, 0.000000, 0.0000... stderr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32032] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 stderr llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true stderr llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false stderr llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... stderr llama_model_loader: - kv 21: general.quantization_version u32 = 2 stderr llama_model_loader: - type f32: 65 tensors stderr llama_model_loader: - type q6_K: 226 tensors stderr llm_load_vocab: special tokens definition check successful ( 291/32032 ). stderr llm_load_print_meta: format = GGUF V3 (latest) stderr llm_load_print_meta: arch = llama stderr llm_load_print_meta: vocab type = SPM stderr llm_load_print_meta: n_vocab = 32032 stderr llm_load_print_meta: n_merges = 0 stderr llm_load_print_meta: n_ctx_train = 32768 stderr llm_load_print_meta: n_embd = 4096 stderr llm_load_print_meta: n_head = 32 stderr llm_load_print_meta: n_head_kv = 8 stderr llm_load_print_meta: n_layer = 32 stderr llm_load_print_meta: n_rot = 128 stderr llm_load_print_meta: n_embd_head_k = 128 stderr llm_load_print_meta: n_embd_head_v = 128 stderr llm_load_print_meta: n_gqa = 4 stderr llm_load_print_meta: n_embd_k_gqa = 1024 stderr llm_load_print_meta: n_embd_v_gqa = 1024 stderr llm_load_print_meta: f_norm_eps = 0.0e+00 stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 stderr llm_load_print_meta: f_logit_scale = 0.0e+00 stderr llm_load_print_meta: n_ff = 14336 stderr llm_load_print_meta: n_expert = 0 stderr llm_load_print_meta: n_expert_used = 0 stderr llm_load_print_meta: causal attn = 1 stderr llm_load_print_meta: pooling type = 0 stderr llm_load_print_meta: rope type = 0 stderr llm_load_print_meta: rope scaling = linear stderr llm_load_print_meta: freq_base_train = 10000.0 stderr llm_load_print_meta: freq_scale_train = 1 stderr llm_load_print_meta: n_yarn_orig_ctx = 32768 stderr llm_load_print_meta: rope_finetuned = unknown stderr llm_load_print_meta: ssm_d_conv = 0 stderr llm_load_print_meta: ssm_d_inner = 0 stderr llm_load_print_meta: ssm_d_state = 0 stderr llm_load_print_meta: ssm_dt_rank = 0 stderr llm_load_print_meta: model type = 7B stderr llm_load_print_meta: model ftype = Q6_K stderr llm_load_print_meta: model params = 7.24 B stderr llm_load_print_meta: model size = 5.53 GiB (6.56 BPW) stderr llm_load_print_meta: general.name = jeffq stderr llm_load_print_meta: BOS token = 1 '' stderr llm_load_print_meta: EOS token = 32000 '<|im_end|>' stderr llm_load_print_meta: UNK token = 0 '' stderr llm_load_print_meta: LF token = 13 '<0x0A>' stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes stderr ggml_cuda_init: found 9 CUDA devices: stderr Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 6: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 7: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 8: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr llm_load_tensors: ggml ctx size = 1.11 MiB stderr llm_load_tensors: offloading 32 repeating layers to GPU stderr llm_load_tensors: offloading non-repeating layers to GPU stderr llm_load_tensors: offloaded 33/33 layers to GPU stderr llm_load_tensors: CPU buffer size = 102.64 MiB stderr llm_load_tensors: CUDA0 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA1 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA2 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA3 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA4 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA5 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA6 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA7 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA8 buffer size = 443.97 MiB stderr .................................................................................................... stderr llama_new_context_with_model: n_ctx = 4096 stderr llama_new_context_with_model: n_batch = 512 stderr llama_new_context_with_model: n_ubatch = 512 stderr llama_new_context_with_model: freq_base = 10000.0 stderr llama_new_context_with_model: freq_scale = 1 stderr llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA1 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA2 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA3 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA4 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA5 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA6 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA7 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA8 KV buffer size = 32.00 MiB stderr llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB stderr llama_new_context_with_model: CUDA_Host output buffer size = 0.73 MiB stderr llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) stderr llama_new_context_with_model: CUDA0 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA1 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA2 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA3 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA4 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA5 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA6 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA7 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA8 compute buffer size = 352.02 MiB stderr llama_new_context_with_model: CUDA_Host compute buffer size = 40.02 MiB stderr llama_new_context_with_model: graph nodes = 1030 stderr llama_new_context_with_model: graph splits = 10 stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":6} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":1,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":2,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":3,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":4,"n_ctx_slot":682} stdout {"timestamp":1713602481,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":5,"n_ctx_slot":682} ```
Below is the log when context_size: 512. ``` stderr llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /build/models/5c7cd056ecf9a4bb5b527410b97f48cb (version GGUF V3 (latest)) stderr llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. stderr llama_model_loader: - kv 0: general.architecture str = llama stderr llama_model_loader: - kv 1: general.name str = jeffq stderr llama_model_loader: - kv 2: llama.context_length u32 = 32768 stderr llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 stderr llama_model_loader: - kv 4: llama.block_count u32 = 32 stderr llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 stderr llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 stderr llama_model_loader: - kv 10: llama.rope.freq_base f32 = 10000.000000 stderr llama_model_loader: - kv 11: general.file_type u32 = 18 stderr llama_model_loader: - kv 12: tokenizer.ggml.model str = llama stderr llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32032] = ["", "", "", "<0x00>", "<... stderr llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32032] = [0.000000, 0.000000, 0.000000, 0.0000... stderr llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32032] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... stderr llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 stderr llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 stderr llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true stderr llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false stderr llama_model_loader: - kv 20: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... stderr llama_model_loader: - kv 21: general.quantization_version u32 = 2 stderr llama_model_loader: - type f32: 65 tensors stderr llama_model_loader: - type q6_K: 226 tensors stderr llm_load_vocab: special tokens definition check successful ( 291/32032 ). stderr llm_load_print_meta: format = GGUF V3 (latest) stderr llm_load_print_meta: arch = llama stderr llm_load_print_meta: vocab type = SPM stderr llm_load_print_meta: n_vocab = 32032 stderr llm_load_print_meta: n_merges = 0 stderr llm_load_print_meta: n_ctx_train = 32768 stderr llm_load_print_meta: n_embd = 4096 stderr llm_load_print_meta: n_head = 32 stderr llm_load_print_meta: n_head_kv = 8 stderr llm_load_print_meta: n_layer = 32 stderr llm_load_print_meta: n_rot = 128 stderr llm_load_print_meta: n_embd_head_k = 128 stderr llm_load_print_meta: n_embd_head_v = 128 stderr llm_load_print_meta: n_gqa = 4 stderr llm_load_print_meta: n_embd_k_gqa = 1024 stderr llm_load_print_meta: n_embd_v_gqa = 1024 stderr llm_load_print_meta: f_norm_eps = 0.0e+00 stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 stderr llm_load_print_meta: f_clamp_kqv = 0.0e+00 stderr llm_load_print_meta: f_max_alibi_bias = 0.0e+00 stderr llm_load_print_meta: f_logit_scale = 0.0e+00 stderr llm_load_print_meta: n_ff = 14336 stderr llm_load_print_meta: n_expert = 0 stderr llm_load_print_meta: n_expert_used = 0 stderr llm_load_print_meta: causal attn = 1 stderr llm_load_print_meta: pooling type = 0 stderr llm_load_print_meta: rope type = 0 stderr llm_load_print_meta: rope scaling = linear stderr llm_load_print_meta: freq_base_train = 10000.0 stderr llm_load_print_meta: freq_scale_train = 1 stderr llm_load_print_meta: n_yarn_orig_ctx = 32768 stderr llm_load_print_meta: rope_finetuned = unknown stderr llm_load_print_meta: ssm_d_conv = 0 stderr llm_load_print_meta: ssm_d_inner = 0 stderr llm_load_print_meta: ssm_d_state = 0 stderr llm_load_print_meta: ssm_dt_rank = 0 stderr llm_load_print_meta: model type = 7B stderr llm_load_print_meta: model ftype = Q6_K stderr llm_load_print_meta: model params = 7.24 B stderr llm_load_print_meta: model size = 5.53 GiB (6.56 BPW) stderr llm_load_print_meta: general.name = jeffq stderr llm_load_print_meta: BOS token = 1 '' stderr llm_load_print_meta: EOS token = 32000 '<|im_end|>' stderr llm_load_print_meta: UNK token = 0 '' stderr llm_load_print_meta: LF token = 13 '<0x0A>' stderr ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no stderr ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes stderr ggml_cuda_init: found 9 CUDA devices: stderr Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 2: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 3: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 4: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 5: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 6: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 7: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr Device 8: NVIDIA RTX A6000, compute capability 8.6, VMM: yes stderr llm_load_tensors: ggml ctx size = 1.11 MiB stderr llm_load_tensors: offloading 32 repeating layers to GPU stderr llm_load_tensors: offloading non-repeating layers to GPU stderr llm_load_tensors: offloaded 33/33 layers to GPU stderr llm_load_tensors: CPU buffer size = 102.64 MiB stderr llm_load_tensors: CUDA0 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA1 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA2 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA3 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA4 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA5 buffer size = 511.97 MiB stderr llm_load_tensors: CUDA6 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA7 buffer size = 682.62 MiB stderr llm_load_tensors: CUDA8 buffer size = 443.97 MiB stderr .................................................................................................... stderr llama_new_context_with_model: n_ctx = 4096 stderr llama_new_context_with_model: n_batch = 512 stderr llama_new_context_with_model: n_ubatch = 512 stderr llama_new_context_with_model: freq_base = 10000.0 stderr llama_new_context_with_model: freq_scale = 1 stderr llama_kv_cache_init: CUDA0 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA1 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA2 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA3 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA4 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA5 KV buffer size = 48.00 MiB stderr llama_kv_cache_init: CUDA6 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA7 KV buffer size = 64.00 MiB stderr llama_kv_cache_init: CUDA8 KV buffer size = 32.00 MiB stderr llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB stderr llama_new_context_with_model: CUDA_Host output buffer size = 0.73 MiB stderr llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) stderr llama_new_context_with_model: CUDA0 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA1 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA2 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA3 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA4 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA5 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA6 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA7 compute buffer size = 352.01 MiB stderr llama_new_context_with_model: CUDA8 compute buffer size = 352.02 MiB stderr llama_new_context_with_model: CUDA_Host compute buffer size = 40.02 MiB stderr llama_new_context_with_model: graph nodes = 1030 stderr llama_new_context_with_model: graph splits = 10 stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":502,"message":"initializing slots","n_slots":6} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":0,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":1,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":2,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":3,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":4,"n_ctx_slot":682} stdout {"timestamp":1713602668,"level":"INFO","function":"initialize","line":511,"message":"new slot","slot_id":5,"n_ctx_slot":682} ```

Both look the same to me.

Taikono-Himazin commented 4 months ago

I don't feel like the text-to-text.yaml settings are reflected correctly. The version is v2.12.4.

My text-to-text.yaml is this ``` name: gpt-4 mmap: true parameters: model: huggingface://NousResearch/Hermes-2-Pro-Mistral-7B-GGUF/Hermes-2-Pro-Mistral-7B.Q8_0.gguf template: chat_message: | <|im_start|>{{if eq .RoleName "assistant"}}assistant{{else if eq .RoleName "system"}}system{{else if eq .RoleName "tool"}}tool{{else if eq .RoleName "user"}}user{{end}} {{- if .FunctionCall }}{{end}} {{- if eq .RoleName "tool" }}{{end }} {{- if .Content}} {{.Content}} {{- end }} {{- if .FunctionCall}}{{toJson .FunctionCall}}{{end }} {{- if .FunctionCall }}{{end }} {{- if eq .RoleName "tool" }}{{end }} <|im_end|> # https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF#prompt-format-for-function-calling function: | <|im_start|>system You are a function calling AI model. You are provided with function signatures within XML tags. You may call one or more functions to assist with the user query. Don't make assumptions about what values to plug into functions. Here are the available tools: {{range .Functions}} {'type': 'function', 'function': {'name': '{{.Name}}', 'description': '{{.Description}}', 'parameters': {{toJson .Parameters}} }} {{end}} Use the following pydantic model json schema for each tool call you will make: {'title': 'FunctionCall', 'type': 'object', 'properties': {'arguments': {'title': 'Arguments', 'type': 'object'}, 'name': {'title': 'Name', 'type': 'string'}}, 'required': ['arguments', 'name']} For each function call return a json object with function name and arguments within XML tags as follows: {'arguments': , 'name': } <|im_end|> {{.Input -}} <|im_start|>assistant parallel_calls: true chat: | {{.Input -}} <|im_start|>assistant completion: | {{.Input}} context_size: 512 f16: true feature_flags: usage: true stopwords: - <|im_end|> - - "\n" - "\n\n\n" usage: | curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4", "messages": [{"role": "user", "content": "How are you doing?", "temperature": 0.1}] } ```
Taikono-Himazin commented 4 months ago

very sorry. The configuration file I was changing was located elsewhere. When I changed something in the correct location, the change was reflected correctly.

However, I would still like to know how to change settings other than context_size. It would be nice if you could specify the arguments to pass when starting the llama-cpp backend.

neowisard commented 2 months ago

I would like to pass parameters such as : -ts 4,4 -sm row -ns 2 -ctk q8_0 -ctv q8_0 -fa to llama.cpp. Is this still possible?