Closed voghoei closed 1 month ago
There isn't enough information here to reproduce the error. Could you provide more details?
Same error. Tried running this official code (https://cookbook.openai.com/examples/how_to_count_tokens_with_tiktoken) simply on visual studio in windows.
Error: " raise ValueError( ValueError: Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']"
What version of tiktoken did you use? How did you install it? etc
model: gpt-4-1106-preview installed using pip 23.3.1 python 3.12.0 tiktoken Version: 0.5.2
Same error.
Windows 10 Python 3.10.11 Installed using pip 23.3.1 tiktoken 0.5.2
I have the same error.
MacOS 14.3.1 pip 23.2.1 tiktoken 0.6.0 Python 3.11.5
Exact error:
Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']
Code that triggers it:
import logging
import tiktoken
def calculate_tokens(text, model_name):
"""Calculate the number of tokens for a given text and model."""
try:
encoding_name = tiktoken.encoding_for_model(model_name)
encoding = tiktoken.get_encoding(encoding_name)
return len(encoding.encode(text))
except Exception as e:
logging.error(f"Failed to retrieve 'cl100k_base' encoding: {e}")
raise
gpt3_5_tokens = calculate_tokens("Hello World!", "gpt-3.5-turbo")
I have the same error.
MacOS 14.3.1 pip 23.2.1 tiktoken 0.6.0 Python 3.11.5
Exact error:
Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']
Code that triggers it:
import logging import tiktoken def calculate_tokens(text, model_name): """Calculate the number of tokens for a given text and model.""" try: encoding_name = tiktoken.encoding_for_model(model_name) encoding = tiktoken.get_encoding(encoding_name) return len(encoding.encode(text)) except Exception as e: logging.error(f"Failed to retrieve 'cl100k_base' encoding: {e}") raise gpt3_5_tokens = calculate_tokens("Hello World!", "gpt-3.5-turbo")
seems hardcoding the encoding_name
to a string "cl100k_base"
works.
encoding = tiktoken.get_encoding("cl100k_base")
num_tokens = len(encoding.encode(text))
@hauntsaninja same here. Getting an error with this. Code version: tiktoken==0.6.0
The following code:
encoding_name = tiktoken.encoding_for_model("gpt-3.5")
encoding = tiktoken.get_encoding(encoding_name)
Returns:
ValueError: Unknown encoding <Encoding 'cl100k_base'>. Plugins found: ['tiktoken_ext.openai_public']
@satishsurath @winchester7 @voghoei @jatinmayekar @hauntsaninja
Actually the following seems to be the solution (and problem):
encoding_name = tiktoken.encoding_for_model("gpt-3.5")
Actually returns an encoding object already.
So, to get the name of the encoding you need to use encoding_name.name
, but you don't need to do this, as encoding_name.encode
will work directly.
encoding_name = tiktoken.encoding_for_model("gpt-3.5")
encoding = tiktoken.get_encoding(encoding_name.name)
assert encoding_name.encode("hi") == encoding.encode("hi")
Update on 22.05.2024: based on @KeshavSingh29 & @dsdanielpark
There is a difference between get_encoding
& encoding_for_model
.
Therefore the following works:
encoder = tiktoken.encoding_for_model("gpt-3.5") #works
encoder2 = tiktoken.get_encoding("cl100k_base") #works
#and
assert encoder.encode("Hi") == encoder2.encode("Hi")
#and
assert encoder.name == "cl100k_base"
While the following DOES NOT work.
encoder = tiktoken.get_encoding("gpt-3.5") #does not work
encoder2 = tiktoken.encoding_for_model("cl100k_base") #does not work
I stumbled upon the same issue and solution posted by @ai-nikolai works!
The encoding_for_model
returns the tokenizer directly if you provide the name of model whose tokenizer you want to use.
In case you need to get the tokenizer of a specific type, then get_encoding
can be used.
Example:
tokenizer = tiktoken.encoding_for_model("gpt-4o")
# is the same as
tokenizer = tiktoken.get_encoding("o200k_base")
You have to distingush encoding_for_model
from encoding_for_model
tokenizer = tiktoken.encoding_for_model("gpt-4o") # Work
tokenizer = tiktoken.encoding_for_model("o200k_base") # Not work
tokenizer = tiktoken.get_encoding("o200k_base") # Work
tokenizer = tiktoken.encoding_for_model("gpt-4o") # Not work
encoding_name = tiktoken.encoding_for_model("gpt-4o") # returns the Encoding object
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)
encoding_name = tiktoken.encoding_for_model("gpt-4o").name # returns str
print(f'{type(encoding_name) = }')
encoding = tiktoken.get_encoding(encoding_name)
print(encoding)
The first example doesn't work because encoding_for_model
returns an object, not a string. The second example works because it extracts the string representation using the name attribute, which get_encoding
expects.
Whenever I run encoding_name = tiktoken.encoding_for_model("gpt-4o")
It returns the following error:
KeyError: 'Could not automatically map gpt-4o to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'
And when I run: encoder = tiktoken.get_encoding("gpt-4o")
ValueError: Unknown encoding gpt-4o. Plugins found: ['tiktoken_ext.openai_public']
I'm using tiktoken == 0.7.0
Does not work:
encoding_name = tiktoken.encoding_for_model("gpt-4o") # returns the Encoding object print(f'{type(encoding_name) = }') encoding = tiktoken.get_encoding(encoding_name) print(encoding)
Works:
encoding_name = tiktoken.encoding_for_model("gpt-4o").name # returns str print(f'{type(encoding_name) = }') encoding = tiktoken.get_encoding(encoding_name) print(encoding)
The first example doesn't work because
encoding_for_model
returns an object, not a string. The second example works because it extracts the string representation using the name attribute, whichget_encoding
expects.
This solved my issue.
tiktoken 0.8 will have a better error message here
Unknown encoding cl100k_base. Plugins found: ['tiktoken_ext.openai_public']