openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.27k stars 830 forks source link

Add support for fine-tuned models in encoding_for_model #135

Open thespino opened 1 year ago

thespino commented 1 year ago

Issue

When trying to call encoding_for_model providing a fine-tuned model as input, the following error occurs:

KeyError: 'Could not automatically map davinci:ft-personal:finetunedmodel-2023-05-23-20-00-00 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

Analysis

See https://platform.openai.com/docs/models/model-endpoint-compatibility See https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

The following models are allowed for fine-tuning:

All of them use the encoding r50k_base.

Fine-tuned models names always follow this format: model:ft-personal:name:date where

Solutions

Map the models prefixes in MODEL_PREFIX_TO_ENCODING, so that when encoding_for_model calls model_name.startswith, it can also identify all models starting with "davinci", "ada", etc... and, so, identify fine-tuned models.

byrnehollander commented 1 year ago

Thanks for opening this PR @thespino – I've also been running into this issue and am eager to have this released

cc @hauntsaninja

thespino commented 1 year ago

Rebased & synced with main branch