Special tokens will be split when there is no space before them

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Apache License 2.0

8.68k stars 746 forks source link

Hello, i add special tokens use tokenizer.add_special_tokens( [ AddedToken("<|system|>", normalized=True, single_word=False), AddedToken("<|user|>", normalized=True, single_word=False), AddedToken("<|assistant|>", normalized=True, single_word=False), AddedToken("<|observation|>", normalized=True, single_word=False), ] ) and i know that these special tokens will never be processed by the model. But when i test the code below sentence = "<|user|>\n你好<|assistant|>\n" fast_encoded_ids = fast_tokenizer.encode(sentence) fast_tokenized_sentence = fast_tokenizer.tokenize(sentence) and it's output is: ['▁<|user|>', '<0x0A>', '你好', '<', '|', 'ass', 'istant', '|', '>', '<0x0A>'] <|assistant|> is a special token but it is be split. What should i do to let the special token won't be split when there is no space before them?

huggingface / tokenizers

Special tokens will be split when there is no space before them #1408