Closed leizhao1234 closed 6 months ago
Hey! This is currently not supported, I'll be working on this for the next release! This will be either:
Thank you for you replay. But i also found that when i use tokenizer.add_tokens(), not tokenizer.add_special_tokens(), the added tokens still cannot be split, why is this?
It's just that when you add a token, whether it's special or not it will not be split. The normalizer flag is what controls if the token is first normalized. If your normalizer add a _
before the token for example, then <s>
will be split because the representation of the token will be _<s>
and will not match.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
i have converted a slow tokenizer into PreTrainedTokenizerFast, and get a tokenizer.json file.But i found that this tokenizer did not split special tokens.Here is my add_tokens in tokenizer.json:
tokenizer.add_special_tokens( [ AddedToken("[gMASK]", normalized=True, single_word=False), AddedToken("sop", normalized=True, single_word=False), ] )