openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.61k stars 785 forks source link

Numbers + whitespaces are not tokenized properly #177

Closed bzwheeler closed 1 year ago

bzwheeler commented 1 year ago

For gpt-4 (cl100k_base) the string "1 2 3 4 5" using OpenAI's online tokenizer https://platform.openai.com/tokenizer generates 5 tokens [16, 362, 513, 604, 642] but tiktoken results in 9 tokens [16, 220, 17, 220, 18, 220, 19, 220, 20]

hauntsaninja commented 1 year ago

https://platform.openai.com/tokenizer doesn't have the cl100k_base tokeniser