openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.27k stars 830 forks source link

Cannot find the educational submodule in tiktoken #147

Closed bkowshik closed 1 year ago

bkowshik commented 1 year ago

Following the README, https://github.com/openai/tiktoken/blob/main/README.md I tried the following which did not work.

tiktoken contains an educational submodule that is friendlier if you want to learn more about the details of BPE, including code that helps visualise the BPE procedure:

Version: tiktoken==0.4.0

from tiktoken._educational import *

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[49], line 1
----> 1 from tiktoken._educational import *

ModuleNotFoundError: No module named 'tiktoken._educational'
hauntsaninja commented 1 year ago

Thanks, I haven't released a version with it yet, will do so soon. It's visible in the repo here, so if you install from source: https://github.com/openai/tiktoken/blob/main/tiktoken/_educational.py

dvolgyes commented 1 year ago

Sorry for hijacking the issue, but is there any education, description about the token - code pairs? Particularly: I would like to add logit bias to the stop token, increasing the likelihood of shorter or longer answers for gpt-3.5/4, but I haven't found any way to figure out how the stop token is encoded. Maybe an example or a description about that could be added to the educational submodule.

microsoftbuild commented 1 year ago

@dvolgyes

The <|im_end|> token is used to mark the end of message. In our case it marks the end of assistant message. In the extending tiktoken code snippet it can be seen that token 100265 is what it's mapped to. However the API doesn't allow logits for tokens higher than 100257.

dvolgyes commented 1 year ago

@microsoftbuild Thanks for the explanation! Too bad, it would be a neat way to steer the model.

hauntsaninja commented 1 year ago

This was released in 0.5.0