openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.76k stars 801 forks source link

how to import sentencepiece bpe vocabulary into tiktoken? #127

Closed qiancheng99 closed 1 year ago

qiancheng99 commented 1 year ago

I have trained a bpe vocab using sentencepiece. How can I import it into tiktoken? I have tried to use the way of mergeable_rank to my own vocab. But when I try to encode, it shows pyo3_runtime.PanicException: no entry found for key.

For example, I set my mergeable_rank dictionary d["looking".encode()]=0 and d["at".encode()]=1. When I try to encode "looking at", it shows

thread '' panicked at 'no entry found for key', src\lib.rs:104:40 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace Traceback (most recent call last): File "C:/Users/Administrator/PycharmProjects/spm2tiktoken/spm2tiktoken.py", line 81, in print(enc.encode("looking at")) File "C:\Users\Administrator\PycharmProjects\ModelScope\llama_test\lib\site-packages\tiktoken\core.py", line 120, in encode return self._core_bpe.encode(text, allowed_special) pyo3_runtime.PanicException: no entry found for key

hauntsaninja commented 1 year ago

"looking at" contains a space, which is not a rank you've provided. At the very least, you'll want to have 1 token for each of 256 bytes.