huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

Special tokens will be split when there is no space before them #1408

Closed leizhao1234 closed 6 months ago

leizhao1234 commented 7 months ago

Hello, i add special tokens use tokenizer.add_special_tokens( [ AddedToken("<|system|>", normalized=True, single_word=False), AddedToken("<|user|>", normalized=True, single_word=False), AddedToken("<|assistant|>", normalized=True, single_word=False), AddedToken("<|observation|>", normalized=True, single_word=False), ] ) and i know that these special tokens will never be processed by the model. But when i test the code below sentence = "<|user|>\n你好<|assistant|>\n" fast_encoded_ids = fast_tokenizer.encode(sentence) fast_tokenized_sentence = fast_tokenizer.tokenize(sentence) and it's output is: ['▁<|user|>', '<0x0A>', '你好', '<', '|', 'ass', 'istant', '|', '>', '<0x0A>'] <|assistant|> is a special token but it is be split. What should i do to let the special token won't be split when there is no space before them?

ArthurZucker commented 6 months ago

Hey, that is because the normalizer adds a before the token. The first normalizer splits on the un-normalized regex (which will not include ▁<|user|>, then the second splits uses the normalized regex because the splits are normalized at this point. You should set normalized = False for the token, or just replace the normalizer following this :https://github.com/huggingface/transformers/pull/26678