Closed matheus-prandini closed 3 months ago
I noticed from this transformers.AutoTokenizer documentation that the Mistral model is indeed using LlamaTokenizer. Why is that? I'm experiencing issues with control tokens that aren't encoded by MistralTokenizer but are encoded by AutoTokenizer (Llama). Perhaps this issue is not a bug. It could be a feature.
@matheus-prandini It's not uncommon for models to use the same processing class e.g. mistral using the LlamaTokenizer if the logic is the same for both e.g. Phi3 uses the llama tokenizer too. In fact, there's no "mistral tokenizer" implemented in transformers as this would mean just copying all of the llama tokenizer code, which we'd like to avoid.
You can see the mappings defined here.
@amyeroberts Thank you very much for your response! My idea is to use an extension of MistralTokenizer
to add new tokens. Since AutoTokenizer
has the add_tokens
method, my initial plan was to load the Mistral model in AutoTokenizer
and add the new tokens through it. However, there is an issue because control tokens are not encoded by MistralTokenizer
but are encoded by AutoTokenizer
. I'll try to extend SentencePieceTokenizer
from Mistral to accomplish this. If that doesn't work, I'll attempt to adjust AutoTokenizer
with MistralTokenizer
by adding control tokens and any other necessary elements...
@matheus-prandini I'm not sure I completely followed what you're trying to achieve here, but the forums are a great place to ask the community about your project and to ask for guidance and help.
Just a few points of clarification / comments :
my initial plan was to load the Mistral model in AutoTokenizer
I'm guessing you meant tokenizer
here rather than model.
However, there is an issue because control tokens are not encoded by MistralTokenizer but are encoded by AutoTokenizer
MistralTokenizer
in the transformers
library. Loading a tokenizer using AutoTokenizer
and a mistral checkpoint will load a LlamaTokenizer
from transformers import tokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
tokenizer.add_tokens(["foo", "bar"])
tokenizer.save_pretrained("my_new_tokenizer")
tokenizer = AutoTokenizer.from_pretrained("my_new_tokenizer")
cc @ArthurZucker who knows more about the ins and outs of the tokenizers
@amyeroberts Sorry if I didn't make it clear before, but my goal is to expand the Mistral tokenizer's vocabulary with new special tokens.
I'm guessing you meant tokenizer here rather than model.
Yes, it's the tokenizer. Sorry!
If you instantiate a tokenizer and add tokens, this tokenizer class (llamatokenizer in this case) will encode those added tokens. You can then save this tokenizer out and load it with AutoTokenizer e.g.
This is the second approach I'm trying. The main issue is the encoding differences between AutoTokenizer
and MistralTokenizer
regarding Mistral's control tokens. For example, the control token [INST]
should not be encoded using its token_id (considering that the token_id for [INST] is equal to 3) to avoid prompt injection. MistralTokenizer
encodes it into three tokens ('[', 'INST', ']'), while AutoTokenizer encodes it into a single token ('[INST]') plus the begin_of_sentence token as shown in the example below. I want to add new control tokens and to adjust this behavior in AutoTokenizer
.
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
mistral_tokenizer = MistralTokenizer.v3()
auto_tokenizer = AutoTokenizer.from_pretrained(model_id)
example = '[INST]'
mistral_result = mistral_tokenizer.instruct_tokenizer.tokenizer.encode(example, bos=False, eos=False)
print(f"Mistral Result: {mistral_result} - Len: {len(mistral_result)}")
auto_result = auto_tokenizer.encode(example)
print(f"Auto Result: {auto_result} - Len: {len(auto_result)}")
Results:
Mistral Result: [1501, 17057, 29561] - Len: 3 (['[', 'INST', ']'])
Auto Result: [1, 3] - Len: 2 (['<s>', '[INST]'])
@amyeroberts I managed to accomplish what I wanted. I had to study SentencePiece
and modify the protobuf to add new tokens, and then load it into MistralTokenizer
. This notebook was very helpful for that: https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb. Thank you for your responses!
@matheus-prandini I think there is a missundertanding here, we specifically designed tokenizers
to make it easy to add new tokens and expend the vocab. Simply calling tokenizer.add_tokens
just solves this as long as the legacy flag is set to False. There is a big warning when you don't properly set this, and your mistral result show the token that is split is exactly because the token is either not added, or normalized!
@ArthurZucker Got it! Thank you for the explanation!
System Info
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I'm trying to load MistralTokenizer using AutoTokenizer from mistral model in the following code snippet:
When I inspect auto_tokenizer variable, then I get LlamaTokenizerFast:
I don't know if I'm missing something, but it is loading a different tokenizer than I expected.
Expected behavior
IMHO it should instantiate a MistralTokenizer.v3() tokenizer as implemented in mistral-common. I checked the
TOKENIZER_MAPPING
object, and Mistral isn't even listed there.