mudler / LocalAI

:robot: The free, Open Source alternative to OpenAI, Claude and others. Self-hosted and local-first. Drop-in replacement for OpenAI, running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. Features: Generate Text, Audio, Video, Images, Voice Cloning, Distributed, P2P inference
https://localai.io
MIT License
25.09k stars 1.9k forks source link

llama2-chat-message.tmpl not working? #1044

Open pchalasani opened 1 year ago

pchalasani commented 1 year ago

LocalAI version: v1.25.0-40-g5661740 (56617409903bde702699a736530053eb4146aec8)

Environment, CPU architecture, OS, and Version: MacOS M1 Max Pro

Describe the bug llama-2-chat-message.tmpl not working

I roughly followed the instructions in "Build on Mac" section: https://localai.io/basics/build/#build-on-mac

I copied llama-2-13b-chat.Q4_K_M.gguf from huggingface, placed it under /models/ and renamed it to llama-2-13b-chat. Then I copied the prompt-templates/llama-chat-message.tmpl to models/llama-2-13b-chat.tmpl

To Reproduce Lanuch this is in one terminal window:

./local-ai --models-path ./models/  --debug

Run this curl in another terminal:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "llama-2-13b-chat",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9
   }'

The model is being correctly loaded, but I get this error saying Template failed loading: (see the end of the section below)

7:35PM DBG Request received:
7:35PM DBG Configuration read: &{PredictionOptions:{Model:llama-2-13b-chat Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
7:35PM DBG Parameters: &{PredictionOptions:{Model:llama-2-13b-chat Language: N:0 TopP:0.7 TopK:80 Temperature:0.9 Maxtokens:512 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name: F16:false Threads:4 Debug:true Roles:map[] Embeddings:false Backend: TemplateConfig:{Chat: ChatMessage: Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:512 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}}
7:35PM DBG Prompt (before templating): How are you?
7:35PM DBG Template failed loading: template: prompt:1:8: executing "prompt" at <.RoleName>: can't evaluate field RoleName in type model.PromptTemplateData
7:35PM DBG Prompt (after templating): How are you?

Expected behavior The template file seems fine, it should work.

pchalasani commented 1 year ago

Note that I am not using any .yaml file, or at least not explicitly, i.e., in the /models/ dir there is not yaml file at all. So it is unclear what options/config the code is assuming, in the absence of a yaml file.

mtharrison commented 1 year ago

Sounds like it is not finding your template file. Maybe try using a model.yaml for your model with a chat_message template configured like https://github.com/go-skynet/model-gallery/blob/main/llama2-chat.yaml#L19 ? I'm not a contributor here, just ran into some similar issues myself.

pchalasani commented 1 year ago

Actually I know it finds the tmpl file because if I remove it, the error msg simply says it could not find a template file. But I will try with that yaml

pchalasani commented 1 year ago

I have another question -- in the context of llama2 models, suppose I write my own code to format a chat history using [INST], <<SYS>>, etc, and then use the /completion endpoint rather than /chat/completion, then can I expect it to work? Or to ask another way, can I assume that the completion end-point is "raw", and sends inputs straight to the model i.e. no further processing will done by localAI ?

maxiannunziata commented 1 year ago

I have two problems, the first one is that I can't open the tmpl file and the second one is that the model gets stuck in a loop, providing responses indefinitely; the call to the backend is never resolved.

How were you able to fix it, @pchalasani ?

` 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 349: blk.38.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 350: blk.38.attn_norm.weight f32 [ 5120, 1, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 351: blk.38.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 352: blk.39.attn_q.weight q8_0 [ 5120, 5120, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 353: blk.39.attn_k.weight q8_0 [ 5120, 5120, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 354: blk.39.attn_v.weight q8_0 [ 5120, 5120, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 355: blk.39.attn_output.weight q8_0 [ 5120, 5120, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 356: blk.39.ffn_gate.weight q8_0 [ 5120, 13824, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 357: blk.39.ffn_up.weight q8_0 [ 5120, 13824, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 358: blk.39.ffn_down.weight q8_0 [ 13824, 5120, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 359: blk.39.attn_norm.weight f32 [ 5120, 1, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 360: blk.39.ffn_norm.weight f32 [ 5120, 1, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 361: output_norm.weight f32 [ 5120, 1, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - tensor 362: output.weight q8_0 [ 5120, 32000, 1, 1 ] 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 0: general.architecture str
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 1: general.name str
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 2: llama.context_length u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 3: llama.embedding_length u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 4: llama.block_count u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 5: llama.feed_forward_length u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 6: llama.rope.dimension_count u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 7: llama.attention.head_count u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 8: llama.attention.head_count_kv u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 10: general.file_type u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 11: tokenizer.ggml.model str
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 13: tokenizer.ggml.scores arr
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - kv 18: general.quantization_version u32
12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - type f32: 81 tensors 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_model_loader: - type q8_0: 282 tensors 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: format = GGUF V2 (latest) 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: arch = llama 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: vocab type = SPM 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_vocab = 32000 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_merges = 0 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ctx_train = 4096 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ctx = 1100 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_embd = 5120 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_head = 40 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_head_kv = 40 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_layer = 40 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_rot = 128 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_gqa = 1 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: f_norm_eps = 1.0e-05 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: f_norm_rms_eps = 1.0e-05 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: n_ff = 13824 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: freq_base = 10000.0 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: freq_scale = 1 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model type = 13B 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model ftype = mostly Q8_0 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: model size = 13.02 B 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: general.name = LLaMA v2 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: BOS token = 1 '' 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: EOS token = 2 '' 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: UNK token = 0 '' 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_print_meta: LF token = 13 '<0x0A>' 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_tensors: ggml ctx size = 13189.98 MB 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llm_load_tensors: mem required = 13189.98 MB (+ 1718.75 MB per state) 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr ................................................................................................... 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_new_context_with_model: kv self size = 1718.75 MB 12:53AM DBG GRPC(coquito-127.0.0.1:36839): stderr llama_new_context_with_model: compute buffer total size = 117.41 MB [127.0.0.1]:53208 200 - GET /readyz [127.0.0.1]:41946 200 - GET /readyz 12:55AM DBG Request received: 12:55AM DBG Configuration read: &{PredictionOptions:{Model:coquito Language: N:0 TopP:0.5 TopK:80 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:coco F16:false Threads:14 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-chat-message Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1100 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}} 12:55AM DBG Parameters: &{PredictionOptions:{Model:coquito Language: N:0 TopP:0.5 TopK:80 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:coco F16:false Threads:14 Debug:true Roles:map[] Embeddings:false Backend:llama TemplateConfig:{Chat: ChatMessage:llama2-chat-message Completion: Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName:} FeatureFlag:map[] LLMConfig:{SystemPrompt:You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:0 MMap:false MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] ContextSize:1100 NUMA:false LoraAdapter: LoraBase: NoMulMatQ:false DraftModel: NDraft:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{PipelineType: SchedulerType: CUDA:false EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:}} 12:55AM DBG templated message for chat: [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

 12:55AM DBG Prompt (before templating): [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

 12:55AM DBG Template failed loading: template: prompt:1:8: executing "prompt" at <.RoleName>: can't evaluate field RoleName in type model.PromptTemplateData 12:55AM DBG Prompt (after templating): [INST] You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

 12:55AM DBG Loading model llama from coquito 12:55AM DBG Model already loaded in memory: coquito [127.0.0.1]:46608 200 - GET /readyz `

pchalasani commented 1 year ago

@maxiannunziata I did not fix it. I gave up as I did not find good documentation on using the yamls and templates, and instead went with another local-llm serving library -- https://github.com/oobabooga/text-generation-webui

jamesbraza commented 11 months ago

@pchalasani any chance you can share your model YAML? Not the template file (posted in OP), but the model YAML.

In https://github.com/mudler/LocalAI/issues/1316, adjusting the template field solved it:

template:
-  chat_message: luna-chat-message
+  chat: luna-chat-message
ShuLaPy commented 11 months ago

@mudler facing the same issue on Mac M2 any solution

ShuLaPy commented 11 months ago

Oh my bad, after following this how to I am able to solve this problem

Each model needs at least 4 files, with out these files, the model will run raw, what that means is you can not change settings of the model.

This is the part I was missing.