openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.16k stars 751 forks source link

Are new line characters separate tokens? #249

Closed GlassBeaver closed 5 months ago

GlassBeaver commented 5 months ago

Hi, on https://platform.openai.com/tokenizer new lines are not treated as separate tokens however in this library, they are. I'm wondering which one is correct and if there are any flags or configuration settings I'm overlooking?

For instance this is 5 tokens on the website but 7 tokens using the lib:

a

b

c
hauntsaninja commented 5 months ago
λ cat z.py
import tiktoken
enc = tiktoken.get_encoding("cl100k_base")
x = """a

b

c"""
print(len(enc.encode(x)))
λ python z.py
5