mistralai / mistral-inference

Official inference library for Mistral models
https://mistral.ai/
Apache License 2.0
9.37k stars 817 forks source link

Mistral's tokenizer is not optimized #134

Open Yarflam opened 4 months ago

Yarflam commented 4 months ago

Hello!

How to reproduce:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.2')

tokenizer.add_bos_token = False
tokenizer.add_eos_token = False

ids = [ 12866, 601 ] # "▁domestic" + "ated"
decode = tokenizer.decode(ids)
encode = tokenizer.encode(decode)
print(encode)
# output -> [2853, 374, 6899] 
# "▁dom" + "est" + "icated"

I don't know what's the best thing to do and if this case has an impact on the calculation. It's just a feedback - but I'm sure it's possible to find another cases.