Open afang-story opened 2 months ago
It also happens with non-Latin characters the other way round e.g.
import tiktoken
cl100k_base = tiktoken.get_encoding("cl100k_base")
pat_str = cl100k_base._pat_str
tik_vocab = {'か'.encode(): 0, 'a'.encode(): 1}
tik_special_tokens = {}
enc = tiktoken.Encoding(
name="tik_test",
pat_str=pat_str,
mergeable_ranks=tik_vocab,
special_tokens=tik_special_tokens
)
print(enc.encode("aか"))
Maybe there's some setting that needs to be changed / some fallbacks that need to be added that cover this?
Hello,
I'm trying to create a custom tokenizer but am getting "pyo3_runtime.PanicException: no entry found for key" despite being sure of coverage. This seems to happen when a character that requires multiple bytes is immediately followed by another character.
Here is a simple example for reproducibility:
Any ideas for how to fix this?
Thanks in advance for the help