llama2-chat-message.tmpl not working?

pchalasani commented 1 year ago

LocalAI version: v1.25.0-40-g5661740 (56617409903bde702699a736530053eb4146aec8)

Environment, CPU architecture, OS, and Version: MacOS M1 Max Pro

Describe the bug llama-2-chat-message.tmpl not working

I roughly followed the instructions in "Build on Mac" section: https://localai.io/basics/build/#build-on-mac

I copied llama-2-13b-chat.Q4_K_M.gguf from huggingface, placed it under /models/ and renamed it to llama-2-13b-chat. Then I copied the prompt-templates/llama-chat-message.tmpl to models/llama-2-13b-chat.tmpl

To Reproduce Lanuch this is in one terminal window:

./local-ai --models-path ./models/  --debug

Run this curl in another terminal:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-13b-chat",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'

The model is being correctly loaded, but I get this error saying Template failed loading: (see the end of the section below)

7:35PM DBG Request received:
7:35PM DBG Configuration read: &{PredictionOptions:{Model:llama-2-13b-chat Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
7:35PM DBG Parameters: &{PredictionOptions:{Model:llama-2-13b-chat Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
7:35PM DBG Prompt (before templating): How are you?
7:35PM DBG Template failed loading: template: prompt:1:8: executing "prompt" at <.RoleName>: can't evaluate field RoleName in type model.PromptTemplateData
7:35PM DBG Prompt (after templating): How are you?

Expected behavior The template file seems fine, it should work.

pchalasani commented 1 year ago

Note that I am not using any .yaml file, or at least not explicitly, i.e., in the /models/ dir there is not yaml file at all. So it is unclear what options/config the code is assuming, in the absence of a yaml file.

mtharrison commented 1 year ago

Sounds like it is not finding your template file. Maybe try using a model.yaml for your model with a chat_message template configured like https://github.com/go-skynet/model-gallery/blob/main/llama2-chat.yaml#L19 ? I'm not a contributor here, just ran into some similar issues myself.

pchalasani commented 1 year ago

Actually I know it finds the tmpl file because if I remove it, the error msg simply says it could not find a template file. But I will try with that yaml

pchalasani commented 1 year ago

I have another question -- in the context of llama2 models, suppose I write my own code to format a chat history using [INST], <<SYS>>, etc, and then use the /completion endpoint rather than /chat/completion, then can I expect it to work? Or to ask another way, can I assume that the completion end-point is "raw", and sends inputs straight to the model i.e. no further processing will done by localAI ?

maxiannunziata commented 1 year ago

I have two problems, the first one is that I can't open the tmpl file and the second one is that the model gets stuck in a loop, providing responses indefinitely; the call to the backend is never resolved.

How were you able to fix it, @pchalasani ?

` 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 349: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 350: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 351: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 352: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 353: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 354: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 355: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 356: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 357: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 358: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 359: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 360: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 361: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 362: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 0: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 1: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 2: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 3: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 4: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 5: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 6: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 7: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 8: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 9: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 10: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 11: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 12: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 13: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 14: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 15: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 16: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 17: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 18: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - type f32: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - type q8_0: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: format 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: arch 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: vocab type 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_vocab 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_merges 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ctx_train 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ctx 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_embd 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_head 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_head_kv 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_layer 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_rot 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_gqa 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: f_norm_eps 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ff 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: freq_base 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: freq_scale 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model type 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model ftype 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model size 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: general.name 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: LF token 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_tensors: ggml 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_tensors: mem required 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr ............................. 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_new_context_with_model: 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_new_context_with_model: [127.0.0.1]:53208 200 - GET /readyz [127.0.0.1]:41946 200 - GET /readyz 12:55AM DBG Request received: 12:55AM DBG Configuration read: &{PredictionOptions:{Model:coquito If a question does not make any sense, or is not factually coherent, explain TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false 12:55AM DBG Parameters: &{PredictionOptions:{Model:coquito Language: If a question does not make any sense, or is not factually coherent, explain TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false 12:55AM DBG templated message for chat: [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully If a question does not make any sense, or is not factually coherent, explain blk.38.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] blk.39.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] blk.39.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] blk.39.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] blk.39.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] blk.39.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] blk.39.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] blk.39.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] output_norm.weight f32 [ 5120, 1, 1, 1 ] output.weight q8_0 [ 5120, 32000, 1, 1 ] general.architecture str
general.name str
llama.context_length u32
llama.embedding_length u32
llama.block_count u32
llama.feed_forward_length u32
llama.rope.dimension_count u32
llama.attention.head_count u32
llama.attention.head_count_kv u32
llama.attention.layer_norm_rms_epsilon f32
general.file_type u32
tokenizer.ggml.model str
tokenizer.ggml.tokens arr
tokenizer.ggml.scores arr
tokenizer.ggml.token_type arr
tokenizer.ggml.bos_token_id u32
tokenizer.ggml.eos_token_id u32
tokenizer.ggml.unknown_token_id u32
general.quantization_version u32
81 tensors 282 tensors = GGUF V2 (latest) = llama = SPM = 32000 = 0 = 4096 = 1100 = 5120 = 40 = 40 = 40 = 128 = 1 = 1.0e-05 f_norm_rms_eps = 1.0e-05 = 13824 = 10000.0 = 1 = 13B = mostly Q8_0 = 13.02 B = LLaMA v2 BOS token = 1 '~~' EOS token = 2 '~~' UNK token = 0 '' = 13 '<0x0A>' ctx size = 13189.98 MB = 13189.98 MB (+ 1718.75 MB per state) ...................................................................... kv self size = 1718.75 MB compute buffer total size = 117.41 MB Language: N:0 TopP:0.5 TopK:80 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:coco F16:false Threads:14 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-chat-message Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1100 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}} N:0 TopP:0.5 TopK:80 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:coco F16:false Threads:14 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-chat-message Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1100 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}} as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

12:55AM DBG Prompt (before templating): [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

12:55AM DBG Template failed loading: template: prompt:1:8: executing "prompt" at <.RoleName>: can't evaluate field RoleName in type model.PromptTemplateData 12:55AM DBG Prompt (after templating): [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

12:55AM DBG Loading model llama from coquito 12:55AM DBG Model already loaded in memory: coquito [127.0.0.1]:46608 200 - GET /readyz `

pchalasani commented 1 year ago

@maxiannunziata I did not fix it. I gave up as I did not find good documentation on using the yamls and templates, and instead went with another local-llm serving library -- https://github.com/oobabooga/text-generation-webui

jamesbraza commented 11 months ago

@pchalasani any chance you can share your model YAML? Not the template file (posted in OP), but the model YAML.

In https://github.com/mudler/LocalAI/issues/1316, adjusting the template field solved it:

template:
-  chat_message: luna-chat-message
+  chat: luna-chat-message

ShuLaPy commented 11 months ago

@mudler facing the same issue on Mac M2 any solution

ShuLaPy commented 11 months ago

Oh my bad, after following this how to I am able to solve this problem

Each model needs at least 4 files, with out these files, the model will run raw, what that means is you can not change settings of the model.

This is the part I was missing.

mudler / LocalAI

llama2-chat-message.tmpl not working? #1044