openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.03k stars 748 forks source link

Custom tokenizer fails to encode despite characters being in mergeable_ranks #289

Open afang-story opened 2 months ago

afang-story commented 2 months ago

Hello,

I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.

Here is a simple example for reproducibility:

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("a“")) # this works, [1, 0]
print(enc.encode("“a"))

Any ideas for how to fix this?

Thanks in advance for the help

Muennighoff commented 2 months ago

It also happens with non-Latin characters the other way round e.g.

import tiktoken

cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str

tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}

enc = tiktoken.Encoding(
    name="tik_test",
    pat_str=pat_str,
    mergeable_ranks=tik_vocab,
    special_tokens=tik_special_tokens
)
print(enc.encode("aか"))

Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?