Closed leizhao1234 closed 6 months ago
Hey, that is because the normalizer adds a ▁
before the token. The first normalizer splits on the un-normalized regex (which will not include ▁<|user|>
, then the second splits uses the normalized regex because the splits are normalized at this point. You should set normalized = False
for the token, or just replace the normalizer following this :https://github.com/huggingface/transformers/pull/26678
Hello, i add special tokens use
tokenizer.add_special_tokens( [ AddedToken("<|system|>", normalized=True, single_word=False), AddedToken("<|user|>", normalized=True, single_word=False), AddedToken("<|assistant|>", normalized=True, single_word=False), AddedToken("<|observation|>", normalized=True, single_word=False), ] )
and i know that these special tokens will never be processed by the model. But when i test the code belowsentence = "<|user|>\n你好<|assistant|>\n" fast_encoded_ids = fast_tokenizer.encode(sentence) fast_tokenized_sentence = fast_tokenizer.tokenize(sentence)
and it's output is:['▁<|user|>', '<0x0A>', '你好', '<', '|', 'ass', 'istant', '|', '>', '<0x0A>']
<|assistant|> is a special token but it is be split. What should i do to let the special token won't be split when there is no space before them?