mistralai / mistral-common

Apache License 2.0
467 stars 36 forks source link

A missing space between <s> and [INST] #3

Open keskival opened 3 months ago

keskival commented 3 months ago

I believe there should be a space before [INST] here: https://github.com/mistralai/mistral-common/blob/fcf0316163433af072f3cb157664c867661cbda7/src/mistral_common/tokens/tokenizers/sentencepiece.py#L167

This is validated by testing. We cannot make the model emit its proper instruct template because it seems to be disinclined to emit it, probably because of training details.

However, the model is very keen on emitting spurious </s> if this space is omitted, especially after numbers such as phone numbers, address numbers and dates.

Notably, this page gives two different chat templates, one matching the one in this repository written out as such, and the pseudocode below it which is missing the spaces after [INST] and after [/INST]: https://docs.mistral.ai/getting-started/open_weight_models/#chat-template

I believe both of these examples are actually incorrect.

Then there is this source: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

It shows the space between <s> and [INST], which I believe is correct. Adding that space to the template gives superior output quality based on qualitative tests.

However, the pseudocode given in that source matches the pseudocode from Mistral documentation and is different from what is shown, and I believe incorrect as well.

Please look into this, and be clear what the chat template actually is; whitespace is very very important.

keskival commented 3 months ago

Here is an example message which consistently cuts off with Mixtral-8x7B after numbers when a bad chat template without the space is used: example_message.txt

The information in the message is synthetic; no real personal information. Message embedded in Python code to remove ambiguities about escapings and linefeeds.

timlacroix commented 3 months ago

Hello ! Agreed whitespaces are frustratingly important in templates (which is why we've moved to control tokens in our newest templates).

I am far from an HF chat template expert but my understanding is :

Your question prompted me to run the same test we have here : https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1#instruct-tokenizer but with v1 compared with the mixtral-8x7b-instruct-v0.1 on HF.

The template on HF is wrong. Our ref implementation in this repo is correct. This is what I tested:

from mistral_common.protocol.instruct.messages import (
    AssistantMessage,
    UserMessage,
)
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.tokens.instruct.normalize import ChatCompletionRequest

from transformers import AutoTokenizer

tokenizer_v1 = MistralTokenizer.v1()

mistral_query = ChatCompletionRequest(
    messages=[
        UserMessage(content="1"),
        AssistantMessage(content="2"),
        UserMessage(content="3"),
    ],
    model="test",
)
tokenized = tokenizer_v1.encode_chat_completion(mistral_query)

tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x7B-Instruct-v0.1')
hf_messages = mistral_query.model_dump()['messages']
tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)

print("MISTRAL_COMMON")
print(tokenized.text)
print(tokenizer_hf.convert_ids_to_tokens(tokenized_mistral))
print("HF : MIXTRAL-8x7B")
print(tokenizer_v1.instruct_tokenizer.tokenizer.to_string(tokenized_hf))
print(tokenizer_hf.convert_ids_to_tokens(tokenized_hf))

tokenized_mistral = tokenized.tokens
assert tokenized_hf == tokenized_mistral

The output is

MISTRAL_COMMON
<s>▁[INST]▁1▁[/INST]▁2</s>▁[INST]▁3▁[/INST]
['<s>', '▁[', 'INST', ']', '▁', '1', '▁[', '/', 'INST', ']', '▁', '2', '</s>', '▁[', 'INST', ']', '▁', '3', '▁[', '/', 'INST', ']']
HF : MIXTRAL-8x7B
<s>▁[INST]▁1▁[/INST]2</s>▁▁[INST]▁3▁[/INST]
['<s>', '▁[', 'INST', ']', '▁', '1', '▁[', '/', 'INST', ']', '2', '</s>', '▁[', 'INST', ']', '▁', '3', '▁[', '/', 'INST', ']']

The only error I can see in the chat template on HF is the missing space before [/INST] and the assistant message Fun fact, the template for Mistral-7B-Instruct is wronger, there is an extra space between and [INST] ...

I would extremely gladly accept a PR that fixes it on all our 7B / 8x7B repos, hf tokenizers are a bit mysterious to me !

Another question since you're saying 8x7B behaves badly with mistral-common tokenization :

Thanks !

keskival commented 2 months ago

We are calling the model through VLLM with a custom chat template. As a tokenizer we use transformers and AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1").

We don't actually add the <s>, because it's added by the tokenizer. Regardless, it seems the space between the <s> and the [INST] is important.

What I need here is the confirmation that the chat template we use is correct from Mistral team, because they are the only ones who are able to check what their real chat template is in the training pipelines. Also, it would be great if they corrected the documentations which are at the very least conflictory if not incorrect.

pandora-s-git commented 1 month ago

It actually depends on the tokenizer, after some verifications, it seems like almost all HF templates need to be rewritten (im working on it using mistral-common as ground truth) but basically there are a few differences between the tokenizer, and so, between the versions of each model. Here are some examples with each tokenizer and a temporary Jinja chat template that I will latelly add to the HF repos: V3 Output:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

V3 Temporary ChatTemplte:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST] ' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V3 Chat template Output:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST] 

V2 Output:

<s>[INST]▁Hello[/INST]▁Hi▁there!</s>[INST]▁How▁are▁you?[/INST]▁Fine▁and▁you?</s>[INST]▁Fine▁thank▁you.[/INST]

V2 Temp chat template:

{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + '[/INST] ' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V2 Chat template Output:

<s>[INST] Hello[/INST] Hi there!</s>[INST] How are you?[/INST] Fine and you?</s>[INST] Fine thank you.[/INST] 

V1 Output

<s>▁[INST]▁Hello▁[/INST]▁Hi▁there!</s>▁[INST]▁How▁are▁you?▁[/INST]▁Fine▁and▁you?</s>▁[INST]▁Fine▁thank▁you.▁[/INST]

V3 Temp chat template:

{{bos_token}}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ ' [INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

V3 Chat template output:

<s> [INST] Hello [/INST] Hi there!</s> [INST] How are you? [/INST] Fine and you?</s> [INST] Fine thank you. [/INST]

I will keep u updated, hopefully this will help a bit for now!

pandora-s-git commented 4 weeks ago

@keskival Following this, I've updated most of the chat templates on HF, the ones I provided previously are not 100% accurate so I recommend using the ones I've added to the HF repos. They are still not 100% perfect, as there seems to be some issues with the tokenizers themselves on some repos, but they are far better than before and should for most of them match with mistral_common now. But basically it's recommended to use mistral_common and take it as ground truth.