Closed afang-story closed 1 month ago
It also happens with non-Latin characters the other way round e.g.
import tiktoken
cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str
tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}
enc = tiktoken.Encoding(
name="tik_test",
pat_str=pat_str,
mergeable_ranks=tik_vocab,
special_tokens=tik_special_tokens
)
print(enc.encode("aか"))
Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?
I'm having the same issue, have you solved it?
Hello,
I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.
Here is a simple example for reproducibility:
import tiktoken cl100k_base = tiktoken.get_encoding("cl100k_base") pat_str = cl100k_base._pat_str tik_vocab = {'“'.encode(): 0, 'a'.encode(): 1} tik_special_tokens = {} enc = tiktoken.Encoding( name="tik_test", pat_str=pat_str, mergeable_ranks=tik_vocab, special_tokens=tik_special_tokens ) print(enc.encode("a“")) # this works, [1, 0] print(enc.encode("“a"))
Any ideas for how to fix this?
Thanks in advance for the help
>>> '“'.encode()
b'\xe2\x80\x9c'
>>> len('“'.encode())
3
You'll need to have individual bytes in your vocabulary.
On top of that tiktoken makes the assumption that token index corresponds to merge priority (i.e. the sequence of merges to produce a token needs to produce intermediate tokens with value in increasing order). https://github.com/openai/tiktoken/blob/63527649963def8c759b0f91f2eb69a40934e468/src/lib.rs#L25
Hello,
I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.
Here is a simple example for reproducibility:
Any ideas for how to fix this?
Thanks in advance for the help