Open jukofyork opened 8 months ago
By the way if anybody else wants to learn more about the template syntax then this is the reference page:
https://pkg.go.dev/text/template
I was pretty confused to start with when I tried to grep the whole project and could find no reference to "if" or "and" anywhere!
I think being able to see how the final transformed input -> template -> output chain in the logs would help catch these kinds of issues - linking this enhancement feature:
I think a lot of the other models, even if concrete template formats can't be sourced, should probably have their templates changed to use the {{ if and .First .System }}...{{ .System }}...{{ end }}
statement.
As it is the system message is often getting added to every message. This might sometimes be a good idea if you don't want to lose the system message, but by default it shouldn't be doing this and particular care should be taken as to where the system message is added if intentionally including it each time.
Thank you so much for the work to go through all of the templates @jukofyork (both in the models on ollama.ai but also in their respective repos on HF and GitHub). Will get this fixed
Thank you so much for the work to go through all of the templates @jukofyork (both in the models on ollama.ai but also in their respective repos on HF and GitHub). Will get this fixed
No problem and if there are any other original/official models you know of then I can try to find the correct prompt for them too.
I don't think it's really possible to find the prompt format for a lot of the fine-tuned models thought. Most seem to be training on a mix of several different/merged datasets and I don't think even the creators know the correct format sometimes.
I've noticed a couple other errors in the models available from the library:
mistral
models have numCtx defaulting to 2048 instead of 4096 (actually 32568 is probably the correct value). I can't tell fully, but I think Ollama is truncating down to numCtx before loading the prompt into the model?
mistrallite
's tokenizer appears broken. Mistrallite is a long context fine tune of Mistral from the Amazon team, and the prompt format is different than Mistral's and introduces 3 new tokens. When passing the prompt through api/generate, it doesn't appear like those new strings are being properly parsed into the new token values. Full disclosure: I'm new to this and I'm using Mistrallite through LangChain -> Ollama and so the bug may be somewhere between there, so forgive me if my hunch is wrong that this is a bug in the model uploaded to Ollama library.
I've noticed a couple other errors in the models available from the library:
1. `mistral` models have numCtx defaulting to 2048 instead of 4096 (actually 32568 is probably the correct value). I can't tell fully, but I think Ollama is truncating down to numCtx before loading the prompt into the model?
Yeah, I'm still none the wiser what the Mistral and Mixtral models' context actually is. The official pages says they were both trained on 8k context. But then other info says it's 32k.Then yet more info says Mistral uses a sliding window and is really just 8k (or even 4k) and Mixtral was trained to use 32k straight off and the sliding window for it was a bug on release.
I believe the right value is 32K. The sliding window is 4K which effects performance of prompts that are outside that window, but as far as I can tell, we shouldn't be truncating anything less than 32K before passing it to the model. But that's my novice understanding.
Anecdotally, I've tested the model's ability to recall text in long contexts using the default settings in "ollama pull mistral" and it can't remember anything past 2K. When I modify the call to use an 8K context window it is able to recall tokens outside of the 2K window that seems to be the ollama default.
I think the fix is that the Modelfile for mistral and it's variants should specify a num_ctx of 32K
I believe the right value is 32K. The sliding window is 4K which effects performance of prompts that are outside that window, but as far as I can tell, we shouldn't be truncating anything less than 32K before passing it to the model. But that's my novice understanding.
Is this for Mistral
or Mixtral
? I only ask because a lot on the SillyTaven reddit report that Mistral
runs into problems around 8k context (or possibly even 6.5k IIRC?).
The original Mistral (7B and it's variants including instruct-v0.1, v0.2, etc.).
The way the sliding window works - you'll see degradation after the 4K sliding window (so it's best performance is in the 4k), but that performance should trail off the longer the context (in increments of 4K) all the way to 32K where it will stop "remembering" anything beyond that.
My experience with Mistral in Ollama using the default Modelfile is that rather than the gradual performance degradation you'd expect after 4k, it actually is only sending 2K of tokens and has a steep cliff drop off in performance (it can't remember anything after 2k). Passing in a num_ctx > 2K at runtime fixes that.
I propose that should be the default in the Modelfile, but I don't think the Ollama model library is in a github repo anywhere that we can generate pull requests. Please correct me if I'm wrong.
Ah, thanks. I'm actually just running everything but the coding models at 4k context for now as the num_batch
bug makes it too fidly to find the right value.
I should add one other thing, it sounds like Mistral's sliding window attention (SWA) is not actually implemented in llama.cpp (which Ollama uses) and so almost assuredly doesn't work the way described in their paper. But it does "work" in that it can generate coherent responses.
Llama.cpp discussion: https://github.com/ggerganov/llama.cpp/issues/3867#issuecomment-1787815958
in fact, according to the mistral paper its trained on 8k context
Parameter | Value |
---|---|
dim | 4096 |
n_layers | 32 |
head_dim | 128 |
hidden_dim | 14336 |
n_heads | 32 |
n_kv_heads | 8 |
window_size | 4096 |
context_len | 8192 |
vocab_size | 32000 |
the 32k context was a misinterpretation from the beginning.. see more info on this discussion https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/discussions/43
I spent all afternoon running different experiments and am actually shocked at how much finding the proper prompt has improved all 3 models:
It's made Mistral about as good as the other 2 were before, and the other 2 are now MUCH better; with all the weirdness (ie: where they claimed to make changes to code when they didn't etc) gone now.
I've marked the spaces with '■' so they stand out, but you will need to change them. Also remember if you aren't using Ollama or llama.cpp you might need to add back the <s>
prefix:
Mistral
and Miqu
:
TEMPLATE """{{ if and .First .System }}[INST]■{{ .System }}
Please await further instructions and simply respond with 'Understood'.■[/INST]
Understood</s>■
{{ end }}[INST]■{{ .Prompt }}■[/INST]
{{ .Response }}"""
This agrees with the example on the Mistral page:
text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
Mixtral
:
TEMPLATE """{{ if and .First .System }}■[INST]■{{ .System }}
Please await further instructions and simply respond with 'Understood'.■[/INST]■
Understood</s>
{{ end }}■[INST]■{{ .Prompt }}■[/INST]■
{{ .Response }}"""
This sort of agrees with the example on the Mixtral page:
<s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
But it seems using the newlines before the response like the Mistral example is essential.
I actually got both miqu
and phind-codellama
to give up their real training prompts. Explanation here:
https://huggingface.co/miqudev/miqu-1-70b/discussions/25
TEMPLATE """{{ if and .First .System }}{{ .System }}
{{ end }}[INST] {{ .Prompt }}
[/INST]{{ .Response }}"""
https://huggingface.co/Phind/Phind-CodeLlama-34B-v2/discussions/31
TEMPLATE """{{ if and .First .System }}{{ .System }}
{{ end }}### Instruction:
{{ .Prompt }}
### Response:
{{ .Response }}"""
miqu
is MUCH better with the correct prompt; like unbelievably better!!! :scream:
may as well thow my two cents in the mix.. I have tested a lot of things, but this works really well for mistral models:
TEMPLATE """
{{ if .First }}<s>{{ if .System }}[INST]{{ .System }}[/INST]{{ end }}</s>{{ end }}[INST] {{ .Prompt }} [/INST]
"""
PARAMETER num_ctx 8000
PARAMETER num_gpu -1
PARAMETER num_predict 4000
Unless you have special personality, don't use a system prompt, it works better.
Even if you don't have few-shot prompt or chat history, still include the <s></s>
{{ if .System }}{{ .System }} {{ end }}{{ if .Prompt }}USER: {{ .Prompt }} {{ end }}ASSISTANT: {{ .Response }}
Pretty sure this shouldn't have that extra space between ASSISTANT:
and {{ .Response }}
.
{{ if .System }}<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{{ .System }}<|END_OF_TURN_TOKEN|>{{ end }}{{ if .Prompt }}<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{{ .Prompt }}<|END_OF_TURN_TOKEN|>{{ end }}<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>{{ .Response }}<|END_OF_TURN_TOKEN|>
Shouldn't have the <|END_OF_TURN_TOKEN|>
part after the response as it's in the GGUF file as an EOS
token already.
command-r has the same problem in its template too (as was pointed out in another thread linked above).
Hi,
Some of the mistakes in the
TEMPLATE
definitions for the models you can download from https://ollama.ai are hurting the models to varying degrees. I only found this by accident when experimenting with the API to use some of the code completion / code editing prompts used by the continue project (https://github.com/continuedev/continue/tree/main/core/llm/templates).I've sourced all these primarily by looking at the original tokenizer config and failing that, looking through the official descriptions and/or their respective official Github discussions. I've concentrated on the original/official models (other than
phind-codellama
) as it's hard to find any concrete info on a lot of the "bootleg" fine-tuned models.The ones which are particularly effected are:
codellama
missing the space before the response severely hurts the performance when presented with a large section code. There is a lot of 'cargo cult' prompt templates forcodellama
going around, but this one can be confirmed from their official release page and the tokenizer config.deepseek-llm
having the system message prepended to every message seems to increase the chance of responding in Chinese Unicode characters (Deepseek say specifically it wasn't trained to use a system message).deepseek-coder
quickly fills its context when discussing large sections of code and will start to repeat the system message back at you before completely descending into gibberish (this happens very quickly if using a detailed / long custom system message).llama2
doesn't seem too effected by the missing the space before the response , but again this template can be confirmed from their official release page and the tokenizer config.deepseek-llm
,mixtral
andmistral
absolutely should NOT have a space or newline before the response or they will often respond with gibberish and/or Chinese Unicode characters.The official
mixtral
huggingface page actually tells you a slightly wrong template format, but the original tokenizer config is the same asmistral
.The suggestion for adding "Response" to
phind-codellama
is from the huggingface discussion, so can't confirm if this is true or not.codellama:34b-instruct:
deepseek-coder:33b-instruct:
deepseek-llm:67b-chat:
llama2:70b-chat:
mixtral:8x7b-instruct-v0.1 & mistral:7b-instruct-v0.2:
phind-codellama:34b-v2:
yi:34b-chat:
These two aren't listed on https://ollama.ai but also use the same "ChatML" template as
yi
:mpt:30B-chat:
qwen:72b-chat:
Are there any other "non-bootleg" models I should look at? I might as well do them too if there are any.