nomic-ai / gpt4all-chat

gpt4all-j chat
Other
1.27k stars 155 forks source link

mpt tokenizer: better special token handling #280

Closed apage43 closed 1 year ago

apage43 commented 1 year ago

should allow correct handling of the <|im_start|> , <|im_end|> additional tokens used in the mpt-7b-chat model. See their demo code for the prompt template.

closer to the behavior of huggingface tokenizers, do not attempt to handle additional tokens as if they were part of the original vocabulary as this cannot prevent them from being split into smaller chunks - handle added tokens before the regular tokenizing pass

note this is still necessary even with a "proper" tokenizer implementation