openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.48k stars 856 forks source link

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

Closed afang-story closed 1 month ago

afang-story commented 6 months ago

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))

Any ideas for how to fix this?

Thanks in advance for the help

Muennighoff commented 6 months ago

It also happens with non-Latin characters the other way round e.g.

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("aか"))

Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?

djsaber commented 4 months ago

I'm having the same issue, have you solved it?

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))

Any ideas for how to fix this?

Thanks in advance for the help

hauntsaninja commented 1 month ago
>>> '“'.encode()
b'\xe2\x80\x9c'
>>> len('“'.encode())
3

You'll need to have individual bytes in your vocabulary.

On top of that tiktoken makes the assumption that token index corresponds to merge priority (i.e. the sequence of merges to produce a token needs to produce intermediate tokens with value in increasing order). https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/src/lib.rs#L25