Non-decodable tokens - Githubissues

microsoftbuild commented 1 year ago

The are significant numbers of non-decodable tokens in the cl100k-base BPE. These tokens don't decode back to strings using the utf-8 encoding.

Nether do the models gpt-3.5-turbo or gpt-4 generate these tokens, nor can they be sent to the API because it only accepts utf-8 encoded text.

If the model is forced to generate these tokens using logit_bias set to 100 for the token, the completion comes empty.

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "message": {
        "content": "",
        "role": "assistant"
      }
    }
  ],
  "created": 1691845677,
  "id": "chatcmpl-************************VciB",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 0,
    "prompt_tokens": 8,
    "total_tokens": 8
  }
}

Note how the finish_reason is length even though, max_tokens for this request was set to 1000.

What's the utility of having these tokens in the BPE?

Take these tokens for example: [94, 95, 57923]

The above list is just a tiny fraction of the non-decodable tokens present in the BPE.

hauntsaninja commented 1 year ago

BPE is a "byte" pair encoder. 94 in cl100k is how we represent the single byte \xa2. Fundamentally our transformers are trained on bytes, and we want all possible sequences of bytes to be representable.

Note of course that non-decodable tokens can be concatenated to form valid UTF-8. For example, you mentioned 57923, which will appear in enc.encode(' 馬關')

As for why your logit_bias experiment refuses to sample, I'm not sure. Seems like a bug to me.

microsoftbuild commented 1 year ago

Thanks for the detailed reply @hauntsaninja

Here's the code I'm using to sample:

import openai
openai.api_key = "Valid API Key"  # supply your API key however you choose

conversation = [{
    "role": "user",
    "content": input("Input:")
}]

bias = {57923: 100}

print("GPT: ")
reply = openai.ChatCompletion.create(model="gpt-3.5-turbo",
                                     messages=conversation,
                                     max_tokens=1000,
                                     logit_bias=bias)
print(reply)

openai / tiktoken

Non-decodable tokens #176