openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.03k stars 748 forks source link

Plugins found: ['tiktoken_ext.openai_public'] #218

Open voghoei opened 7 months ago

voghoei commented 7 months ago

Unknown encoding cl100k_base. Plugins found: ['tiktoken_ext.openai_public']

hauntsaninja commented 7 months ago

There isn't enough information here to reproduce the error. Could you provide more details?

jatinmayekar commented 6 months ago

Same error. Tried running this official code (https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) simply on visual studio in windows.

Error: " raise ValueError( ValueError: Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']"

hauntsaninja commented 6 months ago

What version of tiktoken did you use? How did you install it? etc

jatinmayekar commented 6 months ago

model: gpt-4-1106-preview installed using pip 23.3.1 python 3.12.0 tiktoken Version: 0.5.2

winchester7 commented 6 months ago

Same error.

Windows 10 Python 3.10.11 Installed using pip 23.3.1 tiktoken 0.5.2

satishsurath commented 4 months ago

I have the same error.

MacOS 14.3.1 pip 23.2.1 tiktoken 0.6.0 Python 3.11.5

Exact error:

Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']

Code that triggers it:


import logging
import tiktoken

def calculate_tokens(text, model_name):
    """Calculate the number of tokens for a given text and model."""
    try:
      encoding_name = tiktoken.encoding_for_model(model_name)
      encoding = tiktoken.get_encoding(encoding_name)
      return len(encoding.encode(text))
    except Exception as e:
      logging.error(f"Failed to retrieve 'cl100k_base' encoding: {e}")
      raise

gpt3_5_tokens = calculate_tokens("Hello World!", "gpt-3.5-turbo")
satishsurath commented 4 months ago

I have the same error.

MacOS 14.3.1 pip 23.2.1 tiktoken 0.6.0 Python 3.11.5

Exact error:

Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']

Code that triggers it:

import logging
import tiktoken

def calculate_tokens(text, model_name):
    """Calculate the number of tokens for a given text and model."""
    try:
      encoding_name = tiktoken.encoding_for_model(model_name)
      encoding = tiktoken.get_encoding(encoding_name)
      return len(encoding.encode(text))
    except Exception as e:
      logging.error(f"Failed to retrieve 'cl100k_base' encoding: {e}")
      raise

gpt3_5_tokens = calculate_tokens("Hello World!", "gpt-3.5-turbo")

seems hardcoding the encoding_name to a string "cl100k_base" works.

encoding = tiktoken.get_encoding("cl100k_base")
num_tokens = len(encoding.encode(text))
ai-nikolai commented 1 month ago

@hauntsaninja same here. Getting an error with this. Code version: tiktoken==0.6.0

The following code:

encoding_name = tiktoken.encoding_for_model("gpt-3.5")
encoding = tiktoken.get_encoding(encoding_name)

Returns:

ValueError: Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']
ai-nikolai commented 1 month ago

@satishsurath @winchester7 @voghoei @jatinmayekar @hauntsaninja

Actually the following seems to be the solution (and problem):

encoding_name = tiktoken.encoding_for_model("gpt-3.5")

Actually returns an encoding object already.

So, to get the name of the encoding you need to use encoding_name.name, but you don't need to do this, as encoding_name.encode will work directly.

encoding_name = tiktoken.encoding_for_model("gpt-3.5")
encoding = tiktoken.get_encoding(encoding_name.name)

assert encoding_name.encode("hi") == encoding.encode("hi")

Update on 22.05.2024: based on @KeshavSingh29 & @dsdanielpark There is a difference between get_encoding & encoding_for_model.

Therefore the following works:

encoder = tiktoken.encoding_for_model("gpt-3.5") #works
encoder2 = tiktoken.get_encoding("cl100k_base") #works
#and 
assert encoder.encode("Hi") == encoder2.encode("Hi")
#and 
assert encoder.name == "cl100k_base"

While the following DOES NOT work.

encoder = tiktoken.get_encoding("gpt-3.5") #does not work
encoder2 = tiktoken.encoding_for_model("cl100k_base") #does not work
KeshavSingh29 commented 1 month ago

I stumbled upon the same issue and solution posted by @ai-nikolai works! The encoding_for_model returns the tokenizer directly if you provide the name of model whose tokenizer you want to use. In case you need to get the tokenizer of a specific type, then get_encoding can be used. Example:

tokenizer = tiktoken.encoding_for_model("gpt-4o")
# is the same as 
tokenizer = tiktoken.get_encoding("o200k_base")
dsdanielpark commented 1 month ago

You have to distingush encoding_for_model from encoding_for_model

tokenizer = tiktoken.encoding_for_model("gpt-4o")  # Work 

tokenizer = tiktoken.encoding_for_model("o200k_base") # Not work
tokenizer = tiktoken.get_encoding("o200k_base") # Work

tokenizer = tiktoken.encoding_for_model("gpt-4o") # Not work
v-kam commented 3 weeks ago

Does not work:

encoding_name = tiktoken.encoding_for_model("gpt-4o")  # returns the Encoding object
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)

Works:

encoding_name = tiktoken.encoding_for_model("gpt-4o").name  # returns str 
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)

The first example doesn't work because encoding_for_model returns an object, not a string. The second example works because it extracts the string representation using the name attribute, which get_encoding expects.

DanielDaCosta commented 1 week ago

Whenever I run encoding_name = tiktoken.encoding_for_model("gpt-4o")

It returns the following error:

KeyError: 'Could not automatically map gpt-4o to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

And when I run: encoder = tiktoken.get_encoding("gpt-4o")

ValueError: Unknown encoding gpt-4o. Plugins found: ['tiktoken_ext.openai_public']

I'm using tiktoken == 0.7.0