openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.68k stars 795 forks source link

get_encoding error for gpt2, but other encodings fine #63

Closed mobilestack closed 1 year ago

mobilestack commented 1 year ago

The code is like this.

import tiktoken

# runs ok
encoding2 = tiktoken.get_encoding("cl100k_base")

# runs ok
encoding4 = tiktoken.encoding_for_model("gpt-3.5-turbo")

# runs ok
encoding3 = tiktoken.get_encoding("p50k_base")

# runs error !!
encoding3 = tiktoken.get_encoding("gpt2")

The error message is:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[11], line 2
      1 # runs error
----> 2 encoding3 = tiktoken.get_encoding("gpt2")

File ~/work/venv310/lib/python3.10/site-packages/tiktoken/registry.py:63, in get_encoding(encoding_name)
     60     raise ValueError(f"Unknown encoding {encoding_name}")
     62 constructor = ENCODING_CONSTRUCTORS[encoding_name]
---> 63 enc = Encoding(**constructor())
     64 ENCODINGS[encoding_name] = enc
     65 return enc

File ~/work/venv310/lib/python3.10/site-packages/tiktoken_ext/openai_public.py:11, in gpt2()
     10 def gpt2():
---> 11     mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
     12         vocab_bpe_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/vocab.bpe",
     13         encoder_json_file="https://openaipublic.blob.core.windows.net/gpt-2/encodings/main/encoder.json",
     14     )
     15     return {
     16         "name": "gpt2",
     17         "explicit_n_vocab": 50257,
   (...)
     20         "special_tokens": {"<|endoftext|>": 50256},
     21     }

File ~/work/venv310/lib/python3.10/site-packages/tiktoken/load.py:95, in data_gym_to_mergeable_bpe_ranks(vocab_bpe_file, encoder_json_file)
     93 encoder_json_loaded.pop(b"<|endoftext|>", None)
     94 encoder_json_loaded.pop(b"<|startoftext|>", None)
---> 95 assert bpe_ranks == encoder_json_loaded
     97 return bpe_ranks

AssertionError: 

According to another issue that you suggest to run.

python --version
python -c 'import platform; print(platform.platform())'
python -m venv env
source env/bin/activate
env/bin/python -m pip install wheel
env/bin/python -m pip install tiktoken
env/bin/python -c 'import tiktoken; print(tiktoken.get_encoding("gpt2"))'
env/bin/python -c 'import site; import os; print(os.listdir(site.getsitepackages()[0]))'

Since I don't have a python, but I have python3, so I run everything in venv.

Results are something like these.

Python 3.10.3
macOS-13.2-arm64-arm-64bit

(venv310) ➜  ~ pip install wheel
Requirement already satisfied: wheel in ./work/venv310/lib/python3.10/site-packages (0.40.0)
(venv310) ➜  ~ pip install tiktoken
Requirement already satisfied: tiktoken in ./work/venv310/lib/python3.10/site-packages (0.3.1)
Requirement already satisfied: regex>=2022.1.18 in ./work/venv310/lib/python3.10/site-packages (from tiktoken) (2022.10.31)
Requirement already satisfied: requests>=2.26.0 in ./work/venv310/lib/python3.10/site-packages (from tiktoken) (2.28.2)
Requirement already satisfied: charset-normalizer<4,>=2 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (1.26.9)
Requirement already satisfied: idna<4,>=2.5 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in ./work/venv310/lib/python3.10/site-packages (from requests>=2.26.0->tiktoken) (2022.12.7)
(venv310) ➜  ~ python -c 'import site; import os; print(os.listdir(site.getsitepackages()[0]))'
['shellingham-1.5.0.post1.dist-info', 'fastjsonschema', 'dataclasses_json-0.5.7.dist-info', 'typing_extensions-4.5.0.dist-info', 'commonmark-0.9.1.dist-info', 'talib', 'weibo_spider', 'multidict-6.0.4.dist-info', 'async_timeout', 'marshmallow', 'importlib_metadata-4.10.1.dist-info', 'appnope', 'packaging', 'fonttools-4.31.1.dist-info', 'aiohttp', 'rfc3339_validator-0.1.4.dist-info', 'appnope-0.1.3.dist-info', 'certifi-2022.12.7.dist-info', 'pyrsistent-0.19.3.dist-info', 'altgraph-0.17.3.dist-info', 'wcwidth-0.2.6.dist-info', 'qdarkstyle', 'fqdn-1.5.1.dist-info', 'decorator-5.1.1.dist-info', 'tokenizers', 'ffmpeg', 'jupyter_client-8.0.3.dist-info', 'wcwidth', 'idna-3.3.dist-info', 'Jinja2-3.1.2.dist-info', 'websocket', 'markupsafe', 'integv', 'deap', 'jupyter_core', 'lxml-4.8.0.dist-info', 'vnpy_algotrading-1.0.2.dist-info', 'pandocfilters-1.5.0.dist-info', 'ptyprocess-0.7.0.dist-info', 'widgetsnbextension', 'aiosignal-1.3.1.dist-info', 'pytz-2022.1.dist-info', 'bs4-0.0.1.dist-info', 'isoduration-20.11.0.dist-info', 'webcolors-1.12.dist-info', 'webencodings', 'huggingface_hub', 'tinycss2-1.2.1.dist-info', 'yt_dlp', 
'backcall', 'websocket_client-1.5.1.dist-info', 'bleach-6.0.0.dist-info', 'defusedxml', '.DS_Store', 'nbformat', 'mistune', 'webencodings-0.5.1.dist-info', 'shiboken6', 'attrs', 'colorama-0.4.6.dist-info', 'pyrsistent', 'python_dateutil-2.8.2.dist-info', 'pycryptodomex-3.17.dist-info', 'debugpy-1.6.6.dist-info', 'bleach', 'pygments', 'TA_Lib-0.4.24.dist-info', 'pure_eval', 'aiofiles', 'pyparsing-3.0.7.dist-info', 'gpt_index-0.4.28.dist-info', 'asttokens-2.2.1.dist-info', 'pycparser', 'async_timeout-4.0.2.dist-info', 'more_itertools-9.1.0.dist-info', 'soupsieve-2.3.2.post1.dist-info', 
'nbclient-0.7.2.dist-info', 'python_json_logger-2.0.7.dist-info', 'jupyter_server_terminals-0.4.4.dist-info', 'jupyterlab-3.6.1.dist-info', 'vnpy_ctastrategy', 'pylab.py', 'defusedxml-0.7.1.dist-info', 'ipywidgets', 'typer', 'xvideos_dl', 'marshmallow-3.19.0.dist-info', 'youtube_dl', 'shiboken6-6.2.3.dist-info', 'argon2', 'jupyter_core-5.2.0.dist-info', 'pyparsing', 'debugpy', 'cursor', 'requests-2.28.2.dist-info', 'pickleshare-0.7.5.dist-info', 'vnpy_paperaccount', 'stack_data-0.6.2.dist-info', 'stack_data', 'past', 'langchain', 'QDarkStyle-3.0.3.dist-info', 'jinja2', 'nest_asyncio-1.5.6.dist-info', 'jupyter_events-0.6.3.dist-info', 'arrow', 'IPython', 'soupsieve', 'frozenlist', 'Send2Trash-1.8.0.dist-info', 'jupyter_client', 'parso-0.8.3.dist-info', 'seaborn', 'isoduration', 'executing-1.2.0.dist-info', 'six-1.16.0.dist-info', 'mypy_extensions-1.0.0.dist-info', 'EbookLib-0.18.dist-info', 'peewee-3.14.10.dist-info', 'decorator.py', 'filelock-3.9.0.dist-info', 'jupyterlab_widgets-3.0.5.dist-info', 'jupyterlab_plotly', 'llvmlite', 'ipywidgets-8.0.4.dist-info', '_cffi_backend.cpython-310-darwin.so', 'mutagen', 'jsonpointer.py', 'notebook_shim', 'numba-0.56.4.dist-info', 'future-0.18.3.dist-info', 'xvideos_dl-1.3.0.dist-info', 'colorama', 'cffi', 'vnpy_spreadtrading-1.1.4.dist-info', 'aiofiles-22.1.0.dist-info', 
'executing', 'jsonpointer-2.3.dist-info', 'ipykernel_launcher.py', 'llama_index', 'matplotlib_inline', 
'jupyterlab_server', 'jedi', 'send2trash', 'PySide6-6.2.3.dist-info', 'pip-23.0.1.dist-info', 'tests', 'absl', 'ipython_genutils', 'jupyter_server-2.4.0.dist-info', 'Babel-2.12.1.dist-info', 'fqdn', 'youtube_dl-2021.12.17.dist-info', 'vnpy_sqlite', 'fontTools', 'argon2_cffi-21.3.0.dist-info', 'idna', 'json5-0.9.11.dist-info', 'prometheus_client-0.16.0.dist-info', 'importlib_metadata', 'tqdm-4.64.0.dist-info', 
'_argon2_cffi_bindings', 'wheel', 'bs4', 'click', 'pickleshare.py', 'plotly-5.5.0.dist-info', 'tenacity', 'torch', 'comm', 'websockets', 'ipykernel-6.21.3.dist-info', 'aiosqlite', 'mpl_toolkits', 'pytz', 'jupyter_server_fileid-0.8.0.dist-info', 'filelock', 
'langchain-0.0.109.dist-info', 'pydantic-1.10.6.dist-info', 'tiktoken-0.3.1.dist-info', '__pycache__', 'jupyter_ydoc-0.2.3.dist-info', 'transformers-4.26.1.dist-info', 'nbclassic', 'arrow-1.2.3.dist-info', 'altgraph', 'sqlalchemy', 'pyqtgraph', 'shellingham', 'vnpy_ctastrategy-1.0.8.dist-info', 'regex', 'platformdirs-3.1.1.dist-info', 'Pillow-9.0.1.dist-info', 'jupyter_events', 'nbclient', 'plotly', 'numpy', 'jupyterlab_pygments', 'more_itertools', 'SQLAlchemy-1.4.46.dist-info', 'notebook-6.5.3.dist-info', 'pycparser-2.21.dist-info', 'charset_normalizer', 'PIL', 'requests', 'click-7.1.2.dist-info', 'cursor-1.3.5.dist-info', 'absl_py-1.0.0.dist-info', 'pure_eval-0.2.2.dist-info', 'pwiz.py', 'backcall-0.2.0.dist-info', 
'zipp.py', '_plotly_utils', 'ypy_websocket', 'matplotlib-3.5.1-py3.10-nspkg.pth', 'multidict', 'anyio', 'pip', 'cycler-0.11.0.dist-info', 'babel', 'marshmallow_enum', 'tornado', 'pvectorc.cpython-310-darwin.so', 'tomli', 'dataclasses_json', 'seaborn-0.11.2.dist-info', 'jupyter_server_fileid', 'PySide6', 'matplotlib_inline-0.1.6.dist-info', 'nbformat-5.7.3.dist-info', 'jupyterlab_server-2.20.0.dist-info', 'certifi', 'prompt_toolkit', 'pandocfilters.py', 'terminado-0.17.1.dist-info', 'pyinstaller_hooks_contrib-2023.0.dist-info', 'distutils-precedence.pth', 'pyqtgraph-0.12.3.dist-info', 'ipython_genutils-0.2.0.dist-info', 'vnpy_spreadtrading', 'weibo_spider-0.3.0.dist-info', 'sniffio', 'attr', 'pexpect', 'tiktoken', '_pyinstaller_hooks_contrib', 'transformers', 'jsonschema', 'jupyter_ydoc', 'tqdm', 'tzlocal-2.0.0.dist-info', 'PyYAML-6.0.dist-info', 
'yt_dlp-2023.3.4.dist-info', 'Brotli-1.0.9.dist-info', 'jupyterlab_pygments-0.2.2.dist-info', 'mypy_extensions.py', 'ffmpeg_python-0.2.0.dist-info', 'kiwisolver.cpython-310-darwin.so', 'torch-1.13.1.dist-info', 'tokenizers-0.13.2.dist-info', 'MarkupSafe-2.1.2.dist-info', '_yaml', 'huggingface_hub-0.13.2.dist-info', 'aiosqlite-0.18.0.dist-info', 'ptyprocess', 'six.py', 'jupyter_server_terminals', 'playhouse', 'vnpy_algotrading', 'pandas-1.3.5.dist-info', 'json5', 'tinycss2', 'jupyter_server_ydoc-0.6.1.dist-info', 'pexpect-4.8.0.dist-info', 'rfc3339_validator.py', 'macholib-1.16.2.dist-info', 'brotli.py', 'rich', 'cycler.py', 'cffi-1.15.1.dist-info', 'urllib3-1.26.9.dist-info', 'nbclassic-0.5.3.dist-info', 'regex-2022.10.31.dist-info', 'matplotlib', 'yaml', 'prometheus_client', 'vnpy', 'uri_template-1.2.0.dist-info', 'frozenlist-1.3.3.dist-info', 'attrs-22.2.0.dist-info', 'ebooklib', 'rfc3986_validator.py', 'jupyter_server', 'pythonjsonlogger', 
'tiktoken_ext', 'scipy-1.8.0.dist-info', 'numba', 'torchgen', 'urllib3', 'nbconvert', 'wheel-0.40.0.dist-info', 'comm-0.1.2.dist-info', 'rfc3986_validator-0.1.1.dist-info', 'tomli-2.0.1.dist-info', 'ipython-8.11.0.dist-info', 'integv-1.3.0.dist-info', 'rich-10.16.2.dist-info', 'widgetsnbextension-4.0.5.dist-info', 'uri_template', 'prompt_toolkit-3.0.38.dist-info', 'macholib', 'asttokens', 'jupyterlab', 'Cryptodome', 'argon2_cffi_bindings-21.2.0.dist-info', 
'setuptools', 'marshmallow_enum-1.5.1.dist-info', 'Pygments-2.14.0.dist-info', 'numpy-1.21.5.dist-info', 'pkg_resources', 'notebook', 'tenacity-8.2.2.dist-info', 'setuptools-57.0.0.dist-info', 'charset_normalizer-2.0.12.dist-info', '_distutils_hack', 'sniffio-1.3.0.dist-info', '_pyrsistent_version.py', 'pyzmq-25.0.1.dist-info', 'fastjsonschema-2.16.3.dist-info',
 'vnpy-3.0.0.dist-info', 'llvmlite-0.39.1.dist-info', 'notebook_shim-0.2.2.dist-info', 'terminado', 'tornado-6.2.dist-info', 'openai_whisper-20230308.dist-info', 'websockets-10.4.dist-info', 'parso', 'pydantic', 'ypy_websocket-0.8.2.dist-info', 'zipp-3.7.0.dist-info', 'QtPy-2.0.1.dist-info', 'mutagen-1.46.0.dist-info', 'webcolors.py', 'y_py-0.5.9.dist-info', 'beautifulsoup4-4.11.2.dist-info', 'anyio-3.6.2.dist-info', 'openai-0.27.2.dist-info', 'typer-0.3.2.dist-info', 
'peewee.py', 'psutil', 'traitlets', 'libfuturize', 'nbconvert-7.2.9.dist-info', 'matplotlib-3.5.1.dist-info', 'mistune-2.0.5.dist-info', 'future', 'typing_inspect.py', 'lxml', 'aiohttp-3.8.4.dist-info', 'typing_inspect-0.8.0.dist-info', 'scipy', 'vnpy_sqlite-1.0.0.dist-info', 'yarl', 'vnpy_ctabacktester', 'functorch', 'vnpy_paperaccount-1.0.1.dist-info', 'zmq', 'packaging-21.3.dist-info', 'yarl-1.8.2.dist-info', 'qtpy', 'vnpy_ctabacktester-1.0.5.dist-info', 
'kiwisolver-1.4.0.dist-info', 'libpasteurize', '_brotli.cpython-310-darwin.so', 
'plotlywidget', 'ipykernel', 'tzlocal', 'aiosignal', '_plotly_future_', 'jedi-0.18.2.dist-info', 
'y_py', 'pandas', 'dateutil', 'commonmark', 'nest_asyncio.py', 'openai', 'typing_extensions.py', 'whisper', 'gpt_index', 'platformdirs', 'llama_index-0.4.28.dist-info', 'jupyterlab_widgets', 'jupyter.py', 'deap-1.3.1.dist-info', 
'psutil-5.9.4.dist-info', 'traitlets-5.9.0.dist-info', 'jsonschema-4.17.3.dist-info', 'jupyter_server_ydoc']

Hopefully there is a solution. Many thanks!

hauntsaninja commented 1 year ago

Hm, thanks for the detailed environment information, but I'm not able to reproduce.

Can you set export TIKTOKEN_CACHE_DIR="" and retry? This environment variable will prevent tiktoken from using a cache for the vocab files it downloads.

Note that even in the simple publicly available tests this code path is tested: https://github.com/openai/tiktoken/blob/3e8620030c68d2fd6d4ec6d38426e7a1983661f5/tests/test_simple_public.py#L9

mobilestack commented 1 year ago

I tried to set the key, but not solved. Is there a specific path for the cache? I might need to delete the cache manually.

hauntsaninja commented 1 year ago

The logic is here: https://github.com/openai/tiktoken/blob/3e8620030c68d2fd6d4ec6d38426e7a1983661f5/tiktoken/load.py#L33

So typically python -c 'import tempfile; import os; print(os.path.join(tempfile.gettempdir(), "data-gym-cache"))'

hauntsaninja commented 1 year ago

If that doesn't help, maybe you could set a breakpoint and see what the difference between those two dictionaries is.

mobilestack commented 1 year ago

Woo, that works, after deleted the cached files, it turns right now. Thanks a lot!

There might be an error of the file during or after downloading. Not sure if it is needed to check the cached file before use it, or in that assert bpe_ranks == encoder_json_loaded line, might print more info if it failed.