huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

Closed komninoschatzipapas closed 2 months ago

komninoschatzipapas commented 5 months ago

I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the legacy and use_fast options:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=False)
tokenizerl = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=False)
tokenizerf = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=True)
tokenizerlf = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=True)

s = "test<unk> This is a test phrase</s>"

print(
  f'Regular:\t{tokenizer.tokenize(s)}\n\t\t{tokenizer.decode(tokenizer.encode(s))}'
)

print(
  f'Legacy:\t\t{tokenizerl.tokenize(s)}\n\t\t{tokenizerl.decode(tokenizerl.encode(s))}'
)

print(
  f'Fast:\t\t{tokenizerf.tokenize(s)}\n\t\t{tokenizerf.decode(tokenizerf.encode(s))}'
)

print(
  f'Legacy Fast:\t{tokenizerlf.tokenize(s)}\n\t\t{tokenizerlf.decode(tokenizerlf.encode(s))}'
)

Which yields:

Regular:        ['▁test', '<unk>', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk> This is a test phrase</s>
Legacy:         ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Fast:           ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Legacy Fast:    ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']

You can find the full code here.

There seem to be inconsistencies with how legacy=False, use_fast=False tokenizes input compared to the other options.

If either option is set to True, there is an extra space added after tokens like <unk> or other special tokens.

It seems to me that only legacy=False, use_fast=False tokenenizes this input correctly.

We have a production app that extends Mistral with other special tokens besides <unk>, and extra spaces are added after those too.

So right now, we have switched over to legacy=False, use_fast=False, not getting any of the speed advantages of the Rust implementation.

Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏

ArthurZucker commented 5 months ago

Hey! Thanks, a fix can be derived from #1357 and https://github.com/huggingface/transformers/pull/26678. Everything you describe is mentioned there. TLDR; use metaspace with prepend_scheme="first" and no normalizer will be the end of you problems

ArthurZucker commented 5 months ago

I have not had the time to change the default llama fast tokenizer, will try to do asap

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

komninoschatzipapas commented 3 months ago

I think this is still relevant

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 3 weeks ago

This was fixed in transformers you need to set legacy=False 🤗