Inconsistent "\n\n" encoding

openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

MIT License

11.61k stars 785 forks source link

Inconsistent "\n\n" encoding #159

Closed vince62s closed 1 year ago

vince62s commented 1 year ago

Hello,

In [1]: import tiktoken
In [2]: enc = tiktoken.get_encoding("gpt2")
In [3]: enc.encode("This is good.\n\n")
Out[3]: [1212, 318, 922, 13, 628]
In [4]: enc.encode("This is good.\n\nBut in a way.")
Out[4]: [1212, 318, 922, 13, 198, 198, 1537, 287, 257, 835, 13]

How do we explain the fact that double line break is encoded with a single ID (628) at the end of a sentence and tokenized in two double ID (198) when in the middle of text.

Thanks.

hauntsaninja commented 1 year ago

Second last clause of https://github.com/openai/tiktoken/blob/5d970c1100d3210b42497203d6b5c1e30cfda6cb/tiktoken_ext/openai_public.py#L18

vince62s commented 1 year ago

I understand the mechanic, but what is the reason behind such a behavior ?

hauntsaninja commented 1 year ago

Better compression

vince62s commented 1 year ago

sorry to insist but if this is the only reason why in the middle of text this would not be considered as "better compression" ?