Closed microsoftbuild closed 1 year ago
BPE is a "byte" pair encoder. 94 in cl100k is how we represent the single byte \xa2
. Fundamentally our transformers are trained on bytes, and we want all possible sequences of bytes to be representable.
Note of course that non-decodable tokens can be concatenated to form valid UTF-8. For example, you mentioned 57923, which will appear in enc.encode(' 馬關')
As for why your logit_bias experiment refuses to sample, I'm not sure. Seems like a bug to me.
Thanks for the detailed reply @hauntsaninja
Here's the code I'm using to sample:
import openai
openai.api_key = "Valid API Key" # supply your API key however you choose
conversation = [{
"role": "user",
"content": input("Input:")
}]
bias = {57923: 100}
print("GPT: ")
reply = openai.ChatCompletion.create(model="gpt-3.5-turbo",
messages=conversation,
max_tokens=1000,
logit_bias=bias)
print(reply)
The are significant numbers of non-decodable tokens in the cl100k-base BPE. These tokens don't decode back to strings using the utf-8 encoding.
Nether do the models
gpt-3.5-turbo
orgpt-4
generate these tokens, nor can they be sent to the API because it only accepts utf-8 encoded text.If the model is forced to generate these tokens using
logit_bias
set to100
for the token, the completion comes empty.Note how the
finish_reason
islength
even though,max_tokens
for this request was set to 1000.What's the utility of having these tokens in the BPE?
Take these tokens for example:
[94, 95, 57923]
The above list is just a tiny fraction of the non-decodable tokens present in the BPE.