openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
12.48k stars 856 forks source link

Update on the tiktoken tokenizer module? #192

Closed HiThere0175 closed 1 year ago

HiThere0175 commented 1 year ago

Hello everyone,

I've been using the tiktoken tokenizer module for tokenizing text for OpenAI's models. Recently, I encountered some issues with importing specific classes from the module. I wanted to check if there have been any recent updates or changes to the tiktoken tokenizer that I might not be aware of.

Specifically, I've had difficulty with the Tokenizer class. Has anyone else experienced this or knows if there's been a recent change to this part of the module? Any guidance would be greatly appreciated!

Thank you in advance!

hauntsaninja commented 1 year ago

I released version 0.5.0 of tiktoken yesterday. You can see the changelog here https://github.com/openai/tiktoken/blob/main/CHANGELOG.md#v050

Note however, that tiktoken does not and never has had a class named Tokenizer. Maybe you could be more specific about what issues you're seeing?

HiThere0175 commented 1 year ago

Trying to get a token count similar to this individual: https://stackoverflow.com/questions/75804599/openai-api-how-do-i-count-tokens-before-i-send-an-api-request#:~:text=Tiktoken%20is%20a%20fast%20open-source%20tokenizer%20by%20OpenAI.,%5B%22t%22%2C%20%22ik%22%2C%20%22token%22%2C%20%22%20is%22%2C%20%22%20great%22%2C%20%22%21%22%5D%29.

`import sys import openai from tiktoken import Tokenizer

Set your OpenAI API key here

openai.api_key = 'YOUR_API_KEY'

Define the maximum token limit for the model you are using

max_token_limit = 4096 # Adjust as per your model's limit

def count_tokens(text): tokenizer = Tokenizer() tokens = tokenizer.count_tokens(text) return tokens

def chunk_text(text, max_chunk_size): chunks = [] while len(text) > max_chunk_size: chunk, text = text[:max_chunk_size], text[max_chunk_size:] chunks.append(chunk) if text: chunks.append(text) return chunks

def check_token_limit(file_path): with open(file_path, 'r', encoding='utf-8') as file: file_content = file.read()

tokens_count = count_tokens(file_content)

if tokens_count > max_token_limit:
    print(f"Warning: This text exceeds the maximum token limit of {max_token_limit}.")
    print(f"Total tokens in the file: {tokens_count}")
    print("Splitting the file into smaller parts:")

    chunks = chunk_text(file_content, max_token_limit)

    for i, chunk in enumerate(chunks):
        chunk_file_name = f"chunk_{i + 1}.txt"
        with open(chunk_file_name, 'w', encoding='utf-8') as chunk_file:
            chunk_file.write(chunk)
            print(f"Chunk {i + 1}: {chunk_file_name} (Tokens: {count_tokens(chunk)})")

    print("Consider sending these chunks individually to stay within the token limit.")
else:
    print(f"Total tokens in the file: {tokens_count}")

if name == 'main': if len(sys.argv) != 2: print("Usage: python token_checker.py ") sys.exit(1)

file_path = sys.argv[1]
check_token_limit(file_path)

`

hauntsaninja commented 1 year ago

This code literally never worked, from tiktoken import Tokenizer was never a valid import. You can correct example code in the StackOverflow question you linked. I also recommend taking a look at the recipes in https://github.com/openai/openai-cookbook