thespino commented 1 year ago

Issue

When trying to call encoding_for_model providing a fine-tuned model as input, the following error occurs:

KeyError: 'Could not automatically map davinci:ft-personal:finetunedmodel-2023-05-23-20-00-00 to a tokeniser. Please use `tiktok.get_encoding` to explicitly get the tokeniser you expect.'

Analysis

See https://platform.openai.com/docs/models/model-endpoint-compatibility See https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

The following models are allowed for fine-tuning:

davinci
curie
babbage
ada

All of them use the encoding r50k_base.

Fine-tuned models names always follow this format: model:ft-personal:name:date where

model is the base model from which the fine-tuned one has been created
ft-personal is a fixed string that tells that the model is fine-tuned
name is a custom name that the user can give to the new model
date is the date of fine-tuning in the format yyyy-MM-dd-hh-mm-ss

Solutions

Map the models prefixes in MODEL_PREFIX_TO_ENCODING, so that when encoding_for_model calls model_name.startswith, it can also identify all models starting with "davinci", "ada", etc... and, so, identify fine-tuned models.

byrnehollander commented 1 year ago

Thanks for opening this PR @thespino – I've also been running into this issue and am eager to have this released

cc @hauntsaninja

thespino commented 1 year ago

Rebased & synced with main branch

openai / tiktoken

Add support for fine-tuned models in encoding_for_model #135

Issue

Analysis

Solutions