huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Apache License 2.0

8.69k stars 747 forks source link

How to split special token in encode? #1391

Closed leizhao1234 closed 6 months ago

leizhao1234 commented 7 months ago

i have converted a slow tokenizer into PreTrainedTokenizerFast, and get a tokenizer.json file.But i found that this tokenizer did not split special tokens.Here is my add_tokens in tokenizer.json: tokenizer.add_special_tokens( [ AddedToken("[gMASK]", normalized=True, single_word=False), AddedToken("sop", normalized=True, single_word=False), ] )

ArthurZucker commented 7 months ago

Hey! This is currently not supported, I'll be working on this for the next release! This will be either:

a flag that can be passed to encode, which will skip normalization
a normalizer that splits the tokens The former is probably the simplest and makes the most amount of sense imo!

leizhao1234 commented 7 months ago

Thank you for you replay. But i also found that when i use tokenizer.add_tokens(), not tokenizer.add_special_tokens(), the added tokens still cannot be split, why is this?

ArthurZucker commented 7 months ago

It's just that when you add a token, whether it's special or not it will not be split. The normalizer flag is what controls if the token is first normalized. If your normalizer add a _ before the token for example, then <s> will be split because the representation of the token will be _<s> and will not match.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 6 months ago

1419 will fix this waiting to close !