openai / tiktoken

tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MIT License
11.61k stars 785 forks source link

Tiktoken's number of tokens does not match the number of tokens perceived by the gpt-3.5-turbo API #196

Closed viswavi closed 8 months ago

viswavi commented 11 months ago

We have a long prompt (shown at the bottom of this message). Tiktoken (with the encoding name cl100k_base) says this prompt has 2554 tokens, but OpenAI's API says that this prompt contains 2561 tokens. This mismatch is causing issues for our application.

We are using Tiktoken like this:

encoding_name = "cl100k_base"
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))

To say there are 2554 tokens.

Prompt:


As a PromptParser, your objective is to carefully analyze prompts and divide them into two distinct components: an 'Instruction' that provides the primary description of the task, and 'Demonstrations' which are optional examples showcasing the task. Your aim is to generate a JSON dictionary response containing the `Instruction` and `Demonstrations` fields, corresponding to these two components. In case there are no demonstrations provided, the 'Demonstrations' field should be marked as 'N/A'. When including demonstrations, only consider complete examples that consist of both input and output pairs, disregarding any incomplete ones. It is crucial to maintain the precise formatting, word choice, and punctuation exactly as presented in the original prompt. Here are some parsed output you can refer to.

------

Prompt: """
I am trying to cluster entity strings on Wikipedia according to the Wikipedia article title they refer to. To help me with this, for a given entity name, please provide me with a comprehensive set of alternative names that could refer to the same entity. Entities may be weirdly truncated or ambiguous - e.g. "Wind" may refer to the band "Earth, Wind, and Fire" or to "rescue service". For each entity, I will provide you with a sentence where this entity is used to help you understand what this entity refers to. Generate a comprehensive set of alternate entity names as a JSON-formatted list.

Entity: "fictional character"
Context Sentence: "Jenna Marshall is a fictional character created by Sara Shepard for the `` Pretty Little Liars '' book series , and later developed for the Freeform television series adaptation by I. Marlene King and portrayed by Tammin Sursok ."
Alternate Entity Names: ["fictional characters", "characters", "character"]

Entity: "Catholicism"
Context Sentence: "At home , significantly more electorate residents spoke Italian , Cantonese , Mandarin and Greek at home , and whilst the top three religions (Catholicism , no religion and Anglicanism) differed little from other parts of Perth , Buddhism and Eastern Orthodox adherents outnumbered those of the Uniting Church ."
Alternate Entity Names: ["Catholic Church", "Roman Catholic", "Catholic"]

Entity: "Wind"
Context Sentence: "Illinois musicians with a # 1 Billboard Hot 100 hit include artists from the 1950s : Sam Cooke (d. 1964) ; from the 1960s : The Buckinghams ; from the 1970s : Earth , Wind & Fire , The Chi-Lites , The Staple Singers , Minnie Riperton , Styx ; from the 1980s : Chicago , Cheap Trick , REO Speedwagon , Survivor , Richard Marx ; from the 1990s : R. Kelly ; from the 2000s : Kanye West , Twista , Plain White T 's ."

"""

Parsed Outputs:
{"Instruction": "I am trying to cluster entity strings on Wikipedia according to the Wikipedia article title they refer to. To help me with this, for a given entity name, please provide me with a comprehensive set of alternative names that could refer to the same entity. Entities may be weirdly truncated or ambiguous - e.g. \"Wind\" may refer to the band \"Earth, Wind, and Fire\" or to \"rescue service\". For each entity, I will provide you with a sentence where this entity is used to help you understand what this entity refers to. Generate a comprehensive set of alternate entity names as a JSON-formatted list.", "Demonstrations": "Entity: \"fictional character\"\nContext Sentence: \"Jenna Marshall is a fictional character created by Sara Shepard for the `` Pretty Little Liars '' book series , and later developed for the Freeform television series adaptation by I. Marlene King and portrayed by Tammin Sursok .\"\nAlternate Entity Names: [\"fictional characters\", \"characters\", \"character\"]\n\nEntity: \"Catholicism\"\nContext Sentence: \"At home , significantly more electorate residents spoke Italian , Cantonese , Mandarin and Greek at home , and whilst the top three religions (Catholicism , no religion and Anglicanism) differed little from other parts of Perth , Buddhism and Eastern Orthodox adherents outnumbered those of the Uniting Church .\"\nAlternate Entity Names: [\"Catholic Church\", \"Roman Catholic\", \"Catholic\"]"}

------

Prompt: """
You are an expert baker answering users' questions. Reply as agent.

Example conversation:

User: Hey can you help me with something

Agent: Sure! What do you need help with?

User: I want to bake a cake but don't know what temperature to set the oven to.

Agent: For most cakes, the oven should be preheated to 350°F (177°C).

Current conversation:

User: [Insert user's question]

Agent:
"""

Parsed Outputs:
{"Instruction": "You are an expert baker answering users' questions. Reply as agent.", "Demonstrations": "User: Hey can you help me with something\n\nAgent: Sure! What do you need help with?\n\nUser: I want to bake a cake but don't know what temperature to set the oven to.\n\nAgent: For most cakes, the oven should be preheated to 350°F (177°C)."}

------

Prompt: """
You are given a list of integers. A list is shown by comma-separated numbers between two brackets. For example, [7,3,6] is a list. The number in location one is 7, the number in location two is 3, and the number in location three is 6. You should answer with a list such that every element at each location is equal to the product of elements at every other location in the input array.
"""

Parsed Outputs:
{"Instruction": "You are given a list of integers. A list is shown by comma-separated numbers between two brackets. For example, [7,3,6] is a list. The number in location one is 7, the number in location two is 3, and the number in location three is 6. You should answer with a list such that every element at each location is equal to the product of elements at every other location in the input array.", "Demonstrations": "N/A"}

------

Prompt: """
I am learning Japanese. Please translate some Japanese sentences to English. For example, Japanese: その日、人類は思い出した。ヤツらに支配されていた恐怖を鳥籠の中に囚われていた屈辱を English: On that day, humanity remembered the fear of being dominated by them and the humiliation of being trapped in a birdcage.
"""

Parsed Outputs:
{"Instruction": "I am learning Japanese. Please translate some Japanese sentences to English.", "Demonstrations": "Japanese: その日、人類は思い出した。ヤツらに支配されていた恐怖を鳥籠の中に囚われていた屈辱を English: On that day, humanity remembered the fear of being dominated by them and the humiliation of being trapped in a birdcage."}

------

Prompt: """
来到美国后,我需要学习如何自己做饭。你能告诉我一些菜需要准备的原料么?这里有一些例子:1. 菜名:西红柿炒蛋。原料:2. 菜名:青椒肉丝炒肉。原料:瘦肉、青椒、调味料(如大蒜、姜、料酒、生抽、盐、糖、鸡精或味精、胡椒粉)、植物油。
"""

Parsed Outputs:
{"Instruction": "来到美国后,我需要学习如何自己做饭。你能告诉我一些菜需要准备的原料么?", "Demonstrations": "2. 菜名:青椒肉丝炒肉。原料:瘦肉、青椒、调味料(如大蒜、姜、料酒、生抽、盐、糖、鸡精或味精、胡椒粉)、植物油。"}

------

Prompt: """
As a programer, I am learning software development. Here are some of my problems. Input: What is CI/CD? Output: CI/CD is a way to automate and speed up software development by continuously integrating code changes and deploying them quickly and reliably. Input: What is Git? Output:
"""

Parsed Outputs:
{"Instruction": "As a programer, I am learning software development. Here are some of my problems.", "Demonstrations": " Input: What is CI/CD? Output: CI/CD is a way to automate and speed up software development by continuously integrating code changes and deploying them quickly and reliably."}

------

After seeing these parsed output, please parse this prompt:

Prompt: """
Your task is to generate an answer to a natural question. In this task, the input is a string that consists of both a question and a context passage. The context is a descriptive passage related to the question and contains the answer. And the question can range from Math, Cultural, Social, Geometry, Biology, History, Sports, Technology, Science, and so on.

Here are examples with input questions and context passages, along with their expected outputs:

input="Question: What city did Super Bowl 50 take place in? Context: Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50."
output="Santa Clara"

input="Question: What river runs through Warsaw? Context: Warsaw (Polish: Warszawa [varˈʂava] ( listen); see also other names) is the capital and largest city of Poland. It stands on the Vistula River in east-central Poland, roughly 260 kilometres (160 mi) from the Baltic Sea and 300 kilometres (190 mi) from the Carpathian Mountains. Its population is estimated at 1.740 million residents within a greater metropolitan area of 2.666 million residents, which makes Warsaw the 9th most-populous capital city in the European Union. The city limits cover 516.9 square kilometres (199.6 sq mi), while the metropolitan area covers 6,100.43 square kilometres (2,355.39 sq mi)."
output="Vistula River"

input="Question: The Ottoman empire controlled territory on three continents, Africa, Asia and which other? Context: The Ottoman Empire was an imperial state that lasted from 1299 to 1923. During the 16th and 17th centuries, in particular at the height of its power under the reign of Suleiman the Magnificent, the Ottoman Empire was a powerful multinational, multilingual empire controlling much of Southeast Europe, Western Asia, the Caucasus, North Africa, and the Horn of Africa. At the beginning of the 17th century the empire contained 32 provinces and numerous vassal states. Some of these were later absorbed into the empire, while others were granted various types of autonomy during the course of centuries."
output="Europe"

"""

Parsed Outputs:
hauntsaninja commented 8 months ago

If you're using the chat API are you accounting for the chat messages? E.g. see the discussion in Section 6 of https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

If you have further issues, please contact support@openai.com