huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.89k stars 26.51k forks source link

Running AutoTokenizer.from_pretrained with Mistral V3 is actually loading LlamaTokenizer #31375

Closed matheus-prandini closed 3 months ago

matheus-prandini commented 3 months ago

System Info

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

I'm trying to load MistralTokenizer using AutoTokenizer from mistral model in the following code snippet:

from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.3"
auto_tokenizer = AutoTokenizer.from_pretrained(model_id)

When I inspect auto_tokenizer variable, then I get LlamaTokenizerFast:

LlamaTokenizerFast(name_or_path='mistralai/Mistral-7B-Instruct-v0.3', vocab_size=32768, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
    0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    ...
}

I don't know if I'm missing something, but it is loading a different tokenizer than I expected.

Expected behavior

IMHO it should instantiate a MistralTokenizer.v3() tokenizer as implemented in mistral-common. I checked the TOKENIZER_MAPPING object, and Mistral isn't even listed there.

matheus-prandini commented 3 months ago

I noticed from this transformers.AutoTokenizer documentation that the Mistral model is indeed using LlamaTokenizer. Why is that? I'm experiencing issues with control tokens that aren't encoded by MistralTokenizer but are encoded by AutoTokenizer (Llama). Perhaps this issue is not a bug. It could be a feature.

amyeroberts commented 3 months ago

@matheus-prandini It's not uncommon for models to use the same processing class e.g. mistral using the LlamaTokenizer if the logic is the same for both e.g. Phi3 uses the llama tokenizer too. In fact, there's no "mistral tokenizer" implemented in transformers as this would mean just copying all of the llama tokenizer code, which we'd like to avoid.

You can see the mappings defined here.

matheus-prandini commented 3 months ago

@amyeroberts Thank you very much for your response! My idea is to use an extension of MistralTokenizer to add new tokens. Since AutoTokenizer has the add_tokens method, my initial plan was to load the Mistral model in AutoTokenizer and add the new tokens through it. However, there is an issue because control tokens are not encoded by MistralTokenizer but are encoded by AutoTokenizer. I'll try to extend SentencePieceTokenizer from Mistral to accomplish this. If that doesn't work, I'll attempt to adjust AutoTokenizer with MistralTokenizer by adding control tokens and any other necessary elements...

amyeroberts commented 3 months ago

@matheus-prandini I'm not sure I completely followed what you're trying to achieve here, but the forums are a great place to ask the community about your project and to ask for guidance and help.

Just a few points of clarification / comments :

my initial plan was to load the Mistral model in AutoTokenizer

I'm guessing you meant tokenizer here rather than model.

However, there is an issue because control tokens are not encoded by MistralTokenizer but are encoded by AutoTokenizer

from transformers import tokenizer 

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
tokenizer.add_tokens(["foo", "bar"])

tokenizer.save_pretrained("my_new_tokenizer")
tokenizer = AutoTokenizer.from_pretrained("my_new_tokenizer")

cc @ArthurZucker who knows more about the ins and outs of the tokenizers

matheus-prandini commented 3 months ago

@amyeroberts Sorry if I didn't make it clear before, but my goal is to expand the Mistral tokenizer's vocabulary with new special tokens.

I'm guessing you meant tokenizer here rather than model.

Yes, it's the tokenizer. Sorry!

If you instantiate a tokenizer and add tokens, this tokenizer class (llamatokenizer in this case) will encode those added tokens. You can then save this tokenizer out and load it with AutoTokenizer e.g.

This is the second approach I'm trying. The main issue is the encoding differences between AutoTokenizer and MistralTokenizer regarding Mistral's control tokens. For example, the control token [INST] should not be encoded using its token_id (considering that the token_id for [INST] is equal to 3) to avoid prompt injection. MistralTokenizer encodes it into three tokens ('[', 'INST', ']'), while AutoTokenizer encodes it into a single token ('[INST]') plus the begin_of_sentence token as shown in the example below. I want to add new control tokens and to adjust this behavior in AutoTokenizer.

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer

model_id = "mistralai/Mistral-7B-Instruct-v0.3"

mistral_tokenizer = MistralTokenizer.v3()
auto_tokenizer = AutoTokenizer.from_pretrained(model_id)

example = '[INST]'
mistral_result = mistral_tokenizer.instruct_tokenizer.tokenizer.encode(example, bos=False, eos=False)
print(f"Mistral Result: {mistral_result} - Len: {len(mistral_result)}")
auto_result = auto_tokenizer.encode(example)
print(f"Auto Result: {auto_result} - Len: {len(auto_result)}")

Results:

Mistral Result: [1501, 17057, 29561] - Len: 3 (['[', 'INST', ']'])
Auto Result: [1, 3] - Len: 2 (['<s>', '[INST]'])
matheus-prandini commented 3 months ago

@amyeroberts I managed to accomplish what I wanted. I had to study SentencePiece and modify the protobuf to add new tokens, and then load it into MistralTokenizer. This notebook was very helpful for that: https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb. Thank you for your responses!

ArthurZucker commented 3 months ago

@matheus-prandini I think there is a missundertanding here, we specifically designed tokenizers to make it easy to add new tokens and expend the vocab. Simply calling tokenizer.add_tokens just solves this as long as the legacy flag is set to False. There is a big warning when you don't properly set this, and your mistral result show the token that is split is exactly because the token is either not added, or normalized!

matheus-prandini commented 3 months ago

@ArthurZucker Got it! Thank you for the explanation!