Open pchalasani opened 1 year ago
Note that I am not using any .yaml
file, or at least not explicitly, i.e., in the /models/
dir there is not yaml
file at all. So it is unclear what options/config the code is assuming, in the absence of a yaml
file.
Sounds like it is not finding your template file. Maybe try using a model.yaml
for your model with a chat_message
template configured like https://github.com/go-skynet/model-gallery/blob/main/llama2-chat.yaml#L19 ? I'm not a contributor here, just ran into some similar issues myself.
Actually I know it finds the tmpl file because if I remove it, the error msg simply says it could not find a template file. But I will try with that yaml
I have another question -- in the context of llama2
models, suppose I write my own code to format a chat history using [INST], <<SYS>>
, etc, and then use the /completion
endpoint rather than /chat/completion
, then can I expect it to work? Or to ask another way, can I assume that the completion
end-point is "raw", and sends inputs straight to the model i.e. no further processing will done by localAI ?
I have two problems, the first one is that I can't open the tmpl file and the second one is that the model gets stuck in a loop, providing responses indefinitely; the call to the backend is never resolved.
How were you able to fix it, @pchalasani ?
`
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 349: blk.38.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 350: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 351: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 352: blk.39.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 353: blk.39.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 354: blk.39.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 355: blk.39.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 356: blk.39.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 357: blk.39.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 358: blk.39.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 359: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 360: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 361: output_norm.weight f32 [ 5120, 1, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 362: output.weight q8_0 [ 5120, 32000, 1, 1 ]
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 0: general.architecture str
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 1: general.name str
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 2: llama.context_length u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 3: llama.embedding_length u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 4: llama.block_count u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 7: llama.attention.head_count u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 10: general.file_type u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 18: general.quantization_version u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - type f32: 81 tensors
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - type q8_0: 282 tensors
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: format = GGUF V2 (latest)
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: arch = llama
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: vocab type = SPM
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_vocab = 32000
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_merges = 0
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ctx_train = 4096
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ctx = 1100
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_embd = 5120
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_head = 40
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_head_kv = 40
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_layer = 40
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_rot = 128
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_gqa = 1
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: f_norm_eps = 1.0e-05
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ff = 13824
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: freq_base = 10000.0
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: freq_scale = 1
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model type = 13B
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model ftype = mostly Q8_0
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model size = 13.02 B
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: general.name = LLaMA v2
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: BOS token = 1 ''
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: EOS token = 2 ''
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: UNK token = 0 '
12:55AM DBG Prompt (before templating): [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
12:55AM DBG Template failed loading: template: prompt:1:8: executing "prompt" at <.RoleName>: can't evaluate field RoleName in type model.PromptTemplateData 12:55AM DBG Prompt (after templating): [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
12:55AM DBG Loading model llama from coquito 12:55AM DBG Model already loaded in memory: coquito [127.0.0.1]:46608 200 - GET /readyz `
@maxiannunziata I did not fix it. I gave up as I did not find good documentation on using the yamls and templates, and instead went with another local-llm serving library -- https://github.com/oobabooga/text-generation-webui
@pchalasani any chance you can share your model YAML? Not the template file (posted in OP), but the model YAML.
In https://github.com/mudler/LocalAI/issues/1316, adjusting the template
field solved it:
template:
- chat_message: luna-chat-message
+ chat: luna-chat-message
@mudler facing the same issue on Mac M2 any solution
LocalAI version: v1.25.0-40-g5661740 (56617409903bde702699a736530053eb4146aec8)
Environment, CPU architecture, OS, and Version: MacOS M1 Max Pro
Describe the bug
llama-2-chat-message.tmpl
not workingI roughly followed the instructions in "Build on Mac" section: https://localai.io/basics/build/#build-on-mac
I copied
llama-2-13b-chat.Q4_K_M.gguf
from huggingface, placed it under/models/
and renamed it tollama-2-13b-chat
. Then I copied theprompt-templates/llama-chat-message.tmpl
tomodels/llama-2-13b-chat.tmpl
To Reproduce Lanuch this is in one terminal window:
Run this curl in another terminal:
The model is being correctly loaded, but I get this error saying
Template failed loading
: (see the end of the section below)Expected behavior The template file seems fine, it should work.