openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.61k stars 784 forks source link

[haystack][Backward compatibility] MODEL_TO_ENCODING instead of _MODEL_TO_ENCODING, similarly for MODEL_PREFIX_TO_ENCODING #193

Closed cppxaxa closed 11 months ago

cppxaxa commented 11 months ago

Problem

As of today, the haystack python library makes use of the constants "MODEL_TO_ENCODING" and "MODEL_PREFIX_TO_ENCODING" from the tiktoken library and with the tiktoken 0.5.0 version, it breaks.

Workaround

With haystack, we need to use tiktoken 0.4.0 instead of 0.5.0

Fix requested

Can we revert the variable name from the following commit? https://github.com/openai/tiktoken/commit/a7937793e7460ae8e815f506600b7248fad2a803#diff-0d973848bd229418209db2c46c86167000845592ca6b98fad215c21c317bc494

Steps to repro the issue

Run the haystack tutorial in Google colab with T4 GPU runtime: https://haystack.deepset.ai/tutorials/01_basic_qa_pipeline

hauntsaninja commented 11 months ago

Thanks for the report! This is an implementation detail of tiktoken that is subject to change (as opposed to the public encoding_for_model API); can you link me to why haystack needs access to these?

cppxaxa commented 11 months ago

Thanks for the quick response. Unfortunately, I don't have any links on why haystack needs it in their library.

Another update: The haystack tutorial started to work again. Seems haystack is now applying some workarounds on the version of tiktoken. You can close the issue for now.

Colab logs: Collecting tiktoken<0.5.0,>=0.3.2 (from farm-haystack[colab,inference]) Downloading tiktoken-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)

The place where tiktoken is used haystack: https://github.com/deepset-ai/haystack/blob/94c5d6d216cfad5a06484e17614be72a7c2d5af1/haystack/utils/openai_utils.py#L10

hauntsaninja commented 11 months ago

Okay, I see. I should add an encoding_name_for_model, in case they're trying to do the lookup without instantiating encodings. I'll add this API later today and revert the constant names so cross compatibility is easier.

hauntsaninja commented 11 months ago

Fixed in 0.5.1, opened a PR against haystack at https://github.com/deepset-ai/haystack/pull/5785 , thanks for reporting!

ZanSara commented 11 months ago

Thank you @hauntsaninja for taking care of the issue so quickly! :raised_hands: