Mistakes in template definitions on models available to download from https://ollama.ai

jukofyork commented 8 months ago

Hi,

Some of the mistakes in the TEMPLATE definitions for the models you can download from https://ollama.ai are hurting the models to varying degrees. I only found this by accident when experimenting with the API to use some of the code completion / code editing prompts used by the continue project (https://github.com/continuedev/continue/tree/main/core/llm/templates).

I've sourced all these primarily by looking at the original tokenizer config and failing that, looking through the official descriptions and/or their respective official Github discussions. I've concentrated on the original/official models (other than phind-codellama) as it's hard to find any concrete info on a lot of the "bootleg" fine-tuned models.

The ones which are particularly effected are:

codellama missing the space before the response severely hurts the performance when presented with a large section code. There is a lot of 'cargo cult' prompt templates for codellama going around, but this one can be confirmed from their official release page and the tokenizer config.
deepseek-llm having the system message prepended to every message seems to increase the chance of responding in Chinese Unicode characters (Deepseek say specifically it wasn't trained to use a system message).
deepseek-coder quickly fills its context when discussing large sections of code and will start to repeat the system message back at you before completely descending into gibberish (this happens very quickly if using a detailed / long custom system message).

llama2 doesn't seem too effected by the missing the space before the response , but again this template can be confirmed from their official release page and the tokenizer config.

deepseek-llm, mixtral and mistral absolutely should NOT have a space or newline before the response or they will often respond with gibberish and/or Chinese Unicode characters.

The official mixtral huggingface page actually tells you a slightly wrong template format, but the original tokenizer config is the same as mistral.

The suggestion for adding "Response" to phind-codellama is from the huggingface discussion, so can't confirm if this is true or not.

codellama:34b-instruct:

TEMPLATE """<s>[INST] {{ if and .First .System }}<<SYS>>
{{ .System }}
<</SYS>>

{{ end }}{{ .Prompt }} [/INST] {{ .Response }}"""

deepseek-coder:33b-instruct:

TEMPLATE """{{ if and .First .System }}{{ .System }}
{{ end }}### Instruction:
{{ .Prompt }}
### Response:
{{ .Response }}"""

deepseek-llm:67b-chat:

TEMPLATE """User: {{ if and .First .System }}{{ .System }} {{ end }}{{ .Prompt }}

Assistant:{{ .Response }}"""

llama2:70b-chat:

TEMPLATE """<s>[INST] {{ if and .First .System }}<<SYS>>
{{ .System }}
<</SYS>>

{{ end }}{{ .Prompt }} [/INST] {{ .Response }}"""

mixtral:8x7b-instruct-v0.1 & mistral:7b-instruct-v0.2:

TEMPLATE """{{ if .First }}<s>{{ end }}[INST] {{ if and .First .System }}{{ .System }} {{ end }}{{ .Prompt }} [/INST]{{ .Response }}"""

phind-codellama:34b-v2:

TEMPLATE """{{ if and .First .System }}### System Prompt
{{ .System }}

{{ end }}### User Message
{{ .Prompt }}

### Assistant Response
{{ .Response }}"""

yi:34b-chat:

TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""

These two aren't listed on https://ollama.ai but also use the same "ChatML" template as yi:

mpt:30B-chat:

TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""

qwen:72b-chat:

TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""

Are there any other "non-bootleg" models I should look at? I might as well do them too if there are any.

jukofyork commented 8 months ago

By the way if anybody else wants to learn more about the template syntax then this is the reference page:

https://pkg.go.dev/text/template

I was pretty confused to start with when I tried to grep the whole project and could find no reference to "if" or "and" anywhere!

scpedicini commented 8 months ago

I think being able to see how the final transformed input -> template -> output chain in the logs would help catch these kinds of issues - linking this enhancement feature:

https://github.com/jmorganca/ollama/issues/1533

jukofyork commented 8 months ago

I think a lot of the other models, even if concrete template formats can't be sourced, should probably have their templates changed to use the {{ if and .First .System }}...{{ .System }}...{{ end }} statement.

As it is the system message is often getting added to every message. This might sometimes be a good idea if you don't want to lose the system message, but by default it shouldn't be doing this and particular care should be taken as to where the system message is added if intentionally including it each time.

jmorganca commented 8 months ago

Thank you so much for the work to go through all of the templates @jukofyork (both in the models on ollama.ai but also in their respective repos on HF and GitHub). Will get this fixed

jukofyork commented 8 months ago

Thank you so much for the work to go through all of the templates @jukofyork (both in the models on ollama.ai but also in their respective repos on HF and GitHub). Will get this fixed

No problem and if there are any other original/official models you know of then I can try to find the correct prompt for them too.

I don't think it's really possible to find the prompt format for a lot of the fine-tuned models thought. Most seem to be training on a mix of several different/merged datasets and I don't think even the creators know the correct format sometimes.

nathanpbell commented 8 months ago

I've noticed a couple other errors in the models available from the library:

mistral models have numCtx defaulting to 2048 instead of 4096 (actually 32568 is probably the correct value). I can't tell fully, but I think Ollama is truncating down to numCtx before loading the prompt into the model?
mistrallite's tokenizer appears broken. Mistrallite is a long context fine tune of Mistral from the Amazon team, and the prompt format is different than Mistral's and introduces 3 new tokens. When passing the prompt through api/generate, it doesn't appear like those new strings are being properly parsed into the new token values. Full disclosure: I'm new to this and I'm using Mistrallite through LangChain -> Ollama and so the bug may be somewhere between there, so forgive me if my hunch is wrong that this is a bug in the model uploaded to Ollama library.

jukofyork commented 8 months ago

I've noticed a couple other errors in the models available from the library:

1. `mistral` models have numCtx defaulting to 2048 instead of 4096 (actually 32568 is probably the correct value). I can't tell fully, but I think Ollama is truncating down to numCtx before loading the prompt into the model?

Yeah, I'm still none the wiser what the Mistral and Mixtral models' context actually is. The official pages says they were both trained on 8k context. But then other info says it's 32k.Then yet more info says Mistral uses a sliding window and is really just 8k (or even 4k) and Mixtral was trained to use 32k straight off and the sliding window for it was a bug on release.

nathanpbell commented 8 months ago

I believe the right value is 32K. The sliding window is 4K which effects performance of prompts that are outside that window, but as far as I can tell, we shouldn't be truncating anything less than 32K before passing it to the model. But that's my novice understanding.

Anecdotally, I've tested the model's ability to recall text in long contexts using the default settings in "ollama pull mistral" and it can't remember anything past 2K. When I modify the call to use an 8K context window it is able to recall tokens outside of the 2K window that seems to be the ollama default.

I think the fix is that the Modelfile for mistral and it's variants should specify a num_ctx of 32K

jukofyork commented 8 months ago

I believe the right value is 32K. The sliding window is 4K which effects performance of prompts that are outside that window, but as far as I can tell, we shouldn't be truncating anything less than 32K before passing it to the model. But that's my novice understanding.

Is this for Mistral or Mixtral? I only ask because a lot on the SillyTaven reddit report that Mistral runs into problems around 8k context (or possibly even 6.5k IIRC?).

nathanpbell commented 8 months ago

The original Mistral (7B and it's variants including instruct-v0.1, v0.2, etc.).

The way the sliding window works - you'll see degradation after the 4K sliding window (so it's best performance is in the 4k), but that performance should trail off the longer the context (in increments of 4K) all the way to 32K where it will stop "remembering" anything beyond that.

My experience with Mistral in Ollama using the default Modelfile is that rather than the gradual performance degradation you'd expect after 4k, it actually is only sending 2K of tokens and has a steep cliff drop off in performance (it can't remember anything after 2k). Passing in a num_ctx > 2K at runtime fixes that.

I propose that should be the default in the Modelfile, but I don't think the Ollama model library is in a github repo anywhere that we can generate pull requests. Please correct me if I'm wrong.

jukofyork commented 8 months ago

Ah, thanks. I'm actually just running everything but the coding models at 4k context for now as the num_batch bug makes it too fidly to find the right value.

nathanpbell commented 8 months ago

I should add one other thing, it sounds like Mistral's sliding window attention (SWA) is not actually implemented in llama.cpp (which Ollama uses) and so almost assuredly doesn't work the way described in their paper. But it does "work" in that it can generate coherent responses.

Llama.cpp discussion: https://github.com/ggerganov/llama.cpp/issues/3867#issuecomment-1787815958

cognitivetech commented 7 months ago

in fact, according to the mistral paper its trained on 8k context

Parameter	Value
dim	4096
n_layers	32
head_dim	128
hidden_dim	14336
n_heads	32
n_kv_heads	8
window_size	4096
context_len	8192
vocab_size	32000

the 32k context was a misinterpretation from the beginning.. see more info on this discussion https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/discussions/43

jukofyork commented 7 months ago

I spent all afternoon running different experiments and am actually shocked at how much finding the proper prompt has improved all 3 models:

It's made Mistral about as good as the other 2 were before, and the other 2 are now MUCH better; with all the weirdness (ie: where they claimed to make changes to code when they didn't etc) gone now.

I've marked the spaces with '■' so they stand out, but you will need to change them. Also remember if you aren't using Ollama or llama.cpp you might need to add back the <s> prefix:

Mistral and Miqu:

TEMPLATE """{{ if and .First .System }}[INST]■{{ .System }}

Please await further instructions and simply respond with 'Understood'.■[/INST]
Understood</s>■
{{ end }}[INST]■{{ .Prompt }}■[/INST]
{{ .Response }}"""

This agrees with the example on the Mistral page:

text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

Mixtral:

TEMPLATE """{{ if and .First .System }}■[INST]■{{ .System }}

Please await further instructions and simply respond with 'Understood'.■[/INST]■
Understood</s>
{{ end }}■[INST]■{{ .Prompt }}■[/INST]■
{{ .Response }}"""

This sort of agrees with the example on the Mixtral page:

<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

But it seems using the newlines before the response like the Mistral example is essential.

jukofyork commented 7 months ago

I actually got both miqu and phind-codellama to give up their real training prompts. Explanation here:

https://huggingface.co/miqudev/miqu-1-70b/discussions/25

TEMPLATE """{{ if and .First .System }}{{ .System }}

{{ end }}[INST] {{ .Prompt }}
[/INST]{{ .Response }}"""

https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/discussions/31

TEMPLATE """{{ if and .First .System }}{{ .System }}

{{ end }}### Instruction:
{{ .Prompt }}

### Response:
{{ .Response }}"""

miqu is MUCH better with the correct prompt; like unbelievably better!!! :scream:

cognitivetech commented 6 months ago

may as well thow my two cents in the mix.. I have tested a lot of things, but this works really well for mistral models:

TEMPLATE """
{{ if .First  }}<s>{{ if .System  }}[INST]{{ .System }}[/INST]{{ end }}</s>{{ end }}[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000

Unless you have special personality, don't use a system prompt, it works better.

Even if you don't have few-shot prompt or chat history, still include the <s></s>

jukofyork commented 5 months ago

wizardlm2

{{ if .System }}{{ .System }} {{ end }}{{ if .Prompt }}USER: {{ .Prompt }} {{ end }}ASSISTANT: {{ .Response }}

Pretty sure this shouldn't have that extra space between ASSISTANT: and {{ .Response }}.

command-r-plus

{{ if .System }}<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ .System }}<|END_OF_TURN_TOKEN|>{{ end }}{{ if .Prompt }}<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ .Prompt }}<|END_OF_TURN_TOKEN|>{{ end }}<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ .Response }}<|END_OF_TURN_TOKEN|>

Shouldn't have the <|END_OF_TURN_TOKEN|> part after the response as it's in the GGUF file as an EOS token already.

command-r has the same problem in its template too (as was pointed out in another thread linked above).

ollama / ollama

Mistakes in template definitions on models available to download from https://ollama.ai #1977