should allow correct handling of the <|im_start|> , <|im_end|> additional tokens used in the mpt-7b-chat model. See their demo code for the prompt template.
closer to the behavior of huggingface tokenizers, do not attempt to handle additional tokens as if they were part of the original vocabulary as this cannot prevent them from being split into smaller chunks - handle added tokens before the regular tokenizing pass
note this is still necessary even with a "proper" tokenizer implementation
should allow correct handling of the
<|im_start|>
,<|im_end|>
additional tokens used in the mpt-7b-chat model. See their demo code for the prompt template.closer to the behavior of huggingface
tokenizers
, do not attempt to handle additional tokens as if they were part of the original vocabulary as this cannot prevent them from being split into smaller chunks - handle added tokens before the regular tokenizing passnote this is still necessary even with a "proper" tokenizer implementation