We are using tiktoken in various production scenarios and sometimes have the problem that the download of .tiktoken files (e.g., cl100k_base.tiktoken) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances.
More often, when this happens, Encoder.encode() will throw an exception such as
pyo3_runtime.PanicException: no entry found for key
which turns out to be quite hard to track down.
In an effort to make tiktoken more robust for production use, this PR adds the sha256 hash of each of the downloaded files to openai_public.py and augments read_file to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong.
This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.
We are using tiktoken in various production scenarios and sometimes have the problem that the download of
.tiktoken
files (e.g.,cl100k_base.tiktoken
) will get interrupted or fail, causing the cached file to be corrupted in some way. In those cases, the results returned from the encoder will be incorrect and could be damaging to our production instances.More often, when this happens,
Encoder.encode()
will throw an exception such aswhich turns out to be quite hard to track down.
In an effort to make tiktoken more robust for production use, this PR adds the
sha256
hash of each of the downloaded files toopenai_public.py
and augmentsread_file
to check for the hash, if provided, when the file is accessed from the cache or downloaded directly. This causes errors to be flagged at file load time, rather than when the files are used, and provides a more meaningful error message indicating what might have gone wrong.This also protects users of tiktoken from scenarios where a network issue or MITM attack could have corrupted these files in transit.