openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.17k stars 751 forks source link

Add support for checking hash of downloaded files before use. #230

Closed mdwelsh closed 5 months ago

mdwelsh commented 6 months ago

We are using tiktoken in various production scenarios and sometimes have the problem that the download of .tiktoken files (e.g., cl100k_base.tiktoken) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances.

More often, when this happens, Encoder.encode() will throw an exception such as

pyo3_runtime.PanicException: no entry found for key

which turns out to be quite hard to track down.

In an effort to make tiktoken more robust for production use, this PR adds the sha256 hash of each of the downloaded files to openai_public.py and augments read_file to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong.

This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.

mdwelsh commented 6 months ago

Thanks! Anything I need to do to merge this?

hauntsaninja commented 5 months ago

Thank you!