openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.48k stars 856 forks source link

Plugins found: ['tiktoken_ext.openai_public'] #218

Closed voghoei closed 1 month ago

voghoei commented 1 year ago

Unknown encoding cl100k_base. Plugins found: ['tiktoken_ext.openai_public']

hauntsaninja commented 11 months ago

There isn't enough information here to reproduce the error. Could you provide more details?

jatinmayekar commented 11 months ago

Same error. Tried running this official code (https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) simply on visual studio in windows.

Error: " raise ValueError( ValueError: Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']"

hauntsaninja commented 11 months ago

What version of tiktoken did you use? How did you install it? etc

jatinmayekar commented 11 months ago

model: gpt-4-1106-preview installed using pip 23.3.1 python 3.12.0 tiktoken Version: 0.5.2

winchester7 commented 11 months ago

Same error.

Windows 10 Python 3.10.11 Installed using pip 23.3.1 tiktoken 0.5.2

satishsurath commented 9 months ago

I have the same error.

MacOS 14.3.1 pip 23.2.1 tiktoken 0.6.0 Python 3.11.5

Exact error:

Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']

Code that triggers it:


import logging
import tiktoken

def calculate_tokens(text, model_name):
    """Calculate the number of tokens for a given text and model."""
    try:
      encoding_name = tiktoken.encoding_for_model(model_name)
      encoding = tiktoken.get_encoding(encoding_name)
      return len(encoding.encode(text))
    except Exception as e:
      logging.error(f"Failed to retrieve 'cl100k_base' encoding: {e}")
      raise

gpt3_5_tokens = calculate_tokens("Hello World!", "gpt-3.5-turbo")
satishsurath commented 9 months ago

I have the same error.

MacOS 14.3.1 pip 23.2.1 tiktoken 0.6.0 Python 3.11.5

Exact error:

Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']

Code that triggers it:

import logging
import tiktoken

def calculate_tokens(text, model_name):
    """Calculate the number of tokens for a given text and model."""
    try:
      encoding_name = tiktoken.encoding_for_model(model_name)
      encoding = tiktoken.get_encoding(encoding_name)
      return len(encoding.encode(text))
    except Exception as e:
      logging.error(f"Failed to retrieve 'cl100k_base' encoding: {e}")
      raise

gpt3_5_tokens = calculate_tokens("Hello World!", "gpt-3.5-turbo")

seems hardcoding the encoding_name to a string "cl100k_base" works.

encoding = tiktoken.get_encoding("cl100k_base")
num_tokens = len(encoding.encode(text))
ai-nikolai commented 6 months ago

@hauntsaninja same here. Getting an error with this. Code version: tiktoken==0.6.0

The following code:

encoding_name = tiktoken.encoding_for_model("gpt-3.5")
encoding = tiktoken.get_encoding(encoding_name)

Returns:

ValueError: Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']
ai-nikolai commented 6 months ago

@satishsurath @winchester7 @voghoei @jatinmayekar @hauntsaninja

Actually the following seems to be the solution (and problem):

encoding_name = tiktoken.encoding_for_model("gpt-3.5")

Actually returns an encoding object already.

So, to get the name of the encoding you need to use encoding_name.name, but you don't need to do this, as encoding_name.encode will work directly.

encoding_name = tiktoken.encoding_for_model("gpt-3.5")
encoding = tiktoken.get_encoding(encoding_name.name)

assert encoding_name.encode("hi") == encoding.encode("hi")

Update on 22.05.2024: based on @KeshavSingh29 & @dsdanielpark There is a difference between get_encoding & encoding_for_model.

Therefore the following works:

encoder = tiktoken.encoding_for_model("gpt-3.5") #works
encoder2 = tiktoken.get_encoding("cl100k_base") #works
#and 
assert encoder.encode("Hi") == encoder2.encode("Hi")
#and 
assert encoder.name == "cl100k_base"

While the following DOES NOT work.

encoder = tiktoken.get_encoding("gpt-3.5") #does not work
encoder2 = tiktoken.encoding_for_model("cl100k_base") #does not work
KeshavSingh29 commented 6 months ago

I stumbled upon the same issue and solution posted by @ai-nikolai works! The encoding_for_model returns the tokenizer directly if you provide the name of model whose tokenizer you want to use. In case you need to get the tokenizer of a specific type, then get_encoding can be used. Example:

tokenizer = tiktoken.encoding_for_model("gpt-4o")
# is the same as 
tokenizer = tiktoken.get_encoding("o200k_base")
dsdanielpark commented 6 months ago

You have to distingush encoding_for_model from encoding_for_model

tokenizer = tiktoken.encoding_for_model("gpt-4o")  # Work 

tokenizer = tiktoken.encoding_for_model("o200k_base") # Not work
tokenizer = tiktoken.get_encoding("o200k_base") # Work

tokenizer = tiktoken.encoding_for_model("gpt-4o") # Not work
v-kam commented 5 months ago

Does not work:

encoding_name = tiktoken.encoding_for_model("gpt-4o")  # returns the Encoding object
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)

Works:

encoding_name = tiktoken.encoding_for_model("gpt-4o").name  # returns str 
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)

The first example doesn't work because encoding_for_model returns an object, not a string. The second example works because it extracts the string representation using the name attribute, which get_encoding expects.

DanielDaCosta commented 5 months ago

Whenever I run encoding_name = tiktoken.encoding_for_model("gpt-4o")

It returns the following error:

KeyError: 'Could not automatically map gpt-4o to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

And when I run: encoder = tiktoken.get_encoding("gpt-4o")

ValueError: Unknown encoding gpt-4o. Plugins found: ['tiktoken_ext.openai_public']

I'm using tiktoken == 0.7.0

KushtrimVisoka commented 4 months ago

Does not work:

encoding_name = tiktoken.encoding_for_model("gpt-4o")  # returns the Encoding object
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)

Works:

encoding_name = tiktoken.encoding_for_model("gpt-4o").name  # returns str 
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)

The first example doesn't work because encoding_for_model returns an object, not a string. The second example works because it extracts the string representation using the name attribute, which get_encoding expects.

This solved my issue.

hauntsaninja commented 1 month ago

tiktoken 0.8 will have a better error message here