mpt tokenizer: better special token handling

should allow correct handling of the <|im_start|> , <|im_end|> additional tokens used in the mpt-7b-chat model. See their demo code for the prompt template.

closer to the behavior of huggingface tokenizers, do not attempt to handle additional tokens as if they were part of the original vocabulary as this cannot prevent them from being split into smaller chunks - handle added tokens before the regular tokenizing pass

note this is still necessary even with a "proper" tokenizer implementation

nomic-ai / gpt4all-chat

mpt tokenizer: better special token handling #280