eos id not in self.tokens in GrammarlessTokenizer

The bug Some special tokens have ids that are out of the vocab size in transformers, this can happen with fine-tuned models with extra added special tokens to the original tokenizer. It causes the Tokenizer object failing to initialise as the ids were being used and it is out of index in self.tokens.

To Reproduce I was using the GGUF model from "TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" and hosted it as an OpenAI API. The tokenizer I was using was from the original repository "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", loaded with transformers.AutoTokenizer.from_pretrained

This is the code I'm running, the model name I used in the API is not important, I just called it "hosted_model" for now let's say.

from guidance import models

nousmixtral = models.OpenAI(model='hosted_model', tokenizer=tokenizer, base_url = 'http://localhost:5001/v1/', api_key='asdf')

This is the error I got:

IndexError: index 32000 is out of bounds for axis 0 with size 32000

To dig deeper into the problem, I check the vocab size and the special tokens.

tokenizer.vocab_size # returned 32000

tokenizer.get_added_vocab() # returned {'<unk>': 0, '<s>': 1, '</s>': 2, '<|im_end|>': 32000, '<|im_start|>': 32001}

So apparently Nous added these new special tokens to comply with the ChatML format, but this is not included in the vocab size in transformer.

I further look into the code of the class GrammarlessTokenizer, which is responsible for dealing with tokenizers from transformers, and I found this in the class __init__ code:

        # a transformer tokenizer was given that has a byte_decoder
        elif hasattr(tokenizer, "byte_decoder"):
            byte_tokens = []
            for i in range(tokenizer.vocab_size):
                byte_coded = bytes([tokenizer.byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
                byte_tokens.append(byte_coded)
            bos_token_id = tokenizer.bos_token_id
            eos_token_id = tokenizer.eos_token_id

        # a transformer tokenizer was given with byte_decoder
        elif hasattr(tokenizer, "convert_ids_to_tokens"):
            byte_tokens = [bytes(tokenizer.convert_tokens_to_string(['a', tokenizer.convert_ids_to_tokens(i)])[1:], encoding="utf8") for i in range(tokenizer.vocab_size)]
            bos_token_id = tokenizer.bos_token_id
            eos_token_id = tokenizer.eos_token_id

        # a HuggingFace tokenizers tokenizer was given with id_to_token
        elif hasattr(tokenizer, "id_to_token"):
            a_token_ids = tokenizer.encode("a").ids
            if len(a_token_ids) == 3:
                bos_token_id = a_token_ids[0]
                a_id = a_token_ids[1]
                eos_token_id = a_token_ids[2]
            else:
                raise Exception("This tokenizer does not seem to have a BOS and EOS, support for this need to be implemented still.")

            byte_tokens = [bytes(tokenizer.decode([a_id, i])[1:], encoding="utf8") for i in range(tokenizer.get_vocab_size())]
            for i,b in enumerate(byte_tokens):
                if b == b'':
                    byte_tokens[i] = bytes(tokenizer.id_to_token(i), encoding="utf8")

Apparently, the eos token is not added to the byte_tokens list. With hasattr(tokenizer, "convert_ids_to_tokens"), I figured out the type of tokenizer I have with the NousHermes Mixtral model. So I modified the code in the _grammarless.py script.

# changes on "elif hasattr(tokenizer, "convert_ids_to_tokens"):"
        # a transformer tokenizer was given with byte_decoder
        elif hasattr(tokenizer, "convert_ids_to_tokens"):
            byte_tokens = [bytes(tokenizer.convert_tokens_to_string(['a', tokenizer.convert_ids_to_tokens(i)])[1:], encoding="utf8") for i in range(tokenizer.vocab_size)]
            bos_token_id = tokenizer.bos_token_id
            eos_token_id = tokenizer.eos_token_id
            vocab_size = tokenizer.vocab_size
            for v in tokenizer.get_added_vocab().values():
                if v >= vocab_size:
                    byte_tokens.append(bytes(tokenizer.convert_tokens_to_string(['a', tokenizer.convert_ids_to_tokens(v)])[1:], encoding="utf8"))

After reinstalling the package with this change, it seems to work. I suppose similar logic can be implemented for other elif conditions for transformers tokenizers.

System info (please complete the following information):

OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Mac OS
Guidance Version (guidance.__version__): 0.1.11 (installed from source as PyPI does not have this version.)

guidance-ai / guidance

eos id not in self.tokens in GrammarlessTokenizer #658