openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.61k stars 784 forks source link

What is the difference between gpt2 encoding and r50k_base? #187

Closed Mypathissional closed 11 months ago

Mypathissional commented 11 months ago

Hello,

I've been comparing the merge lists of "gpt2" and "r50k_base". Interestingly, they are identical. Furthermore, other configuration details such as regex patterns and special characters are also the same. Given these similarities, I'm curious about the rationale behind having two distinct encoding methods.

Here's the code I used for the comparison:

from tiktoken.load import data_gym_to_mergeable_bpe_ranks, load_tiktoken_bpe

gpt2Merges= data_gym_to_mergeable_bpe_ranks(
    vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
    encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
)

r50kMerges = load_tiktoken_bpe(
    "https://openaipublic.blob.core.windows.net/encodings/r50k_base.tiktoken"
)

r50kMerges == gpt2Merges
True

I'd appreciate any insights or explanations on this matter. Thank you!

hauntsaninja commented 11 months ago

There is no semantic difference. r50k is stored in a format that loads a little faster and has less redundancy. Additionally, it's a nice proof that tiktoken can load GPT-2's encoding, which is as close to a standard as this field gets :-)