Fix corner-case in token length calculation when the model generates tiktoken special tokens like `<|endoftext|>`

sxjscience commented 3 weeks ago

Arena-Hard adopts the gpt-3.5-turbo's tokenizer for measuring the number of tokens in the response. However, the current implementation will trigger an error when the model is generating <|endoftext|>.

Following is an example:

import tiktoken

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

output = """5. **Create a function to generate a response**: Create a function that takes a user message as input, generates a response using the BlenderBot model, and returns the response.

```javascript
async function generateResponse(userMessage) {
  const inputIds = tf.tensor1d([model.vocab['<|endoftext|>'], ...userMessage.split(' ').map(word => model.vocab[word] || model.vocab['<unk>']), model.vocab['<|endoftext|>']]);
  const inputMask = tf.tensor1d([1, ...Array(userMessage.split(' ').length).fill(1), 1]);

  const output = await model.executeAsync({
    input_ids: inputIds,
    attention_mask: inputMask,
  });

  const responseTokens = output[0].dataSync();
  const response = responseTokens.map(token => model.inv_vocab[token]).join(' ').trim();

  return response;
}

"""

encoding.encode(output)


The model generates valid output that contains `<|endoftext|>`, but `encoding.encode` will trigger the following error:

ValueError: Encountered text corresponding to disallowed special token '<|endoftext|>'. If you want this text to be encoded as a special token, pass it to allowed_special, e.g. allowed_special={'<|endoftext|>', ...}. If you want this text to be encoded as normal text, disable the check for this token by passing disallowed_special=(enc.special_tokens_set - {'<|endoftext|>'}). To disable this check for all special tokens, pass disallowed_special=().


This PR fixes the issue by setting the disallowed_special to be empty. You may verify the fix by running the following code:

````python
import tiktoken

encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")

output = """5. **Create a function to generate a response**: Create a function that takes a user message as input, generates a response using the BlenderBot model, and returns the response.

```javascript
async function generateResponse(userMessage) {
  const inputIds = tf.tensor1d([model.vocab['<|endoftext|>'], ...userMessage.split(' ').map(word => model.vocab[word] || model.vocab['<unk>']), model.vocab['<|endoftext|>']]);
  const inputMask = tf.tensor1d([1, ...Array(userMessage.split(' ').length).fill(1), 1]);

  const output = await model.executeAsync({
    input_ids: inputIds,
    attention_mask: inputMask,
  });

  const responseTokens = output[0].dataSync();
  const response = responseTokens.map(token => model.inv_vocab[token]).join(' ').trim();

  return response;
}

"""

encoding.encode(output, disallowed_special=())

sxjscience commented 3 weeks ago

@CodingWithTim The issue is rarely triggered but it may impact model evaluation.

CodingWithTim commented 3 weeks ago

This is great! Thanks for contributing!

lm-sys / arena-hard-auto

Fix corner-case in token length calculation when the model generates tiktoken special tokens like `<|endoftext|>` #28