guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.85k stars 1.04k forks source link

eos id not in self.tokens in GrammarlessTokenizer #658

Open nath1295 opened 7 months ago

nath1295 commented 7 months ago

The bug Some special tokens have ids that are out of the vocab size in transformers, this can happen with fine-tuned models with extra added special tokens to the original tokenizer. It causes the Tokenizer object failing to initialise as the ids were being used and it is out of index in self.tokens.

To Reproduce I was using the GGUF model from "TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" and hosted it as an OpenAI API. The tokenizer I was using was from the original repository "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", loaded with transformers.AutoTokenizer.from_pretrained

This is the code I'm running, the model name I used in the API is not important, I just called it "hosted_model" for now let's say.

from guidance import models

nousmixtral = models.OpenAI(model='hosted_model', tokenizer=tokenizer, base_url = 'http://localhost:5001/v1/', api_key='asdf')

This is the error I got:

IndexError: index 32000 is out of bounds for axis 0 with size 32000

To dig deeper into the problem, I check the vocab size and the special tokens.

tokenizer.vocab_size # returned 32000

tokenizer.get_added_vocab() # returned {'<unk>': 0, '<s>': 1, '</s>': 2, '<|im_end|>': 32000, '<|im_start|>': 32001}

So apparently Nous added these new special tokens to comply with the ChatML format, but this is not included in the vocab size in transformer.

I further look into the code of the class GrammarlessTokenizer, which is responsible for dealing with tokenizers from transformers, and I found this in the class __init__ code:

        # a transformer tokenizer was given that has a byte_decoder
        elif hasattr(tokenizer, "byte_decoder"):
            byte_tokens = []
            for i in range(tokenizer.vocab_size):
                byte_coded = bytes([tokenizer.byte_decoder[c] for c in tokenizer.convert_ids_to_tokens(i)])
                byte_tokens.append(byte_coded)
            bos_token_id = tokenizer.bos_token_id
            eos_token_id = tokenizer.eos_token_id

        # a transformer tokenizer was given with byte_decoder
        elif hasattr(tokenizer, "convert_ids_to_tokens"):
            byte_tokens = [bytes(tokenizer.convert_tokens_to_string(['a', tokenizer.convert_ids_to_tokens(i)])[1:], encoding="utf8") for i in range(tokenizer.vocab_size)]
            bos_token_id = tokenizer.bos_token_id
            eos_token_id = tokenizer.eos_token_id

        # a HuggingFace tokenizers tokenizer was given with id_to_token
        elif hasattr(tokenizer, "id_to_token"):
            a_token_ids = tokenizer.encode("a").ids
            if len(a_token_ids) == 3:
                bos_token_id = a_token_ids[0]
                a_id = a_token_ids[1]
                eos_token_id = a_token_ids[2]
            else:
                raise Exception("This tokenizer does not seem to have a BOS and EOS, support for this need to be implemented still.")

            byte_tokens = [bytes(tokenizer.decode([a_id, i])[1:], encoding="utf8") for i in range(tokenizer.get_vocab_size())]
            for i,b in enumerate(byte_tokens):
                if b == b'':
                    byte_tokens[i] = bytes(tokenizer.id_to_token(i), encoding="utf8")

Apparently, the eos token is not added to the byte_tokens list. With hasattr(tokenizer, "convert_ids_to_tokens"), I figured out the type of tokenizer I have with the NousHermes Mixtral model. So I modified the code in the _grammarless.py script.

# changes on "elif hasattr(tokenizer, "convert_ids_to_tokens"):"
        # a transformer tokenizer was given with byte_decoder
        elif hasattr(tokenizer, "convert_ids_to_tokens"):
            byte_tokens = [bytes(tokenizer.convert_tokens_to_string(['a', tokenizer.convert_ids_to_tokens(i)])[1:], encoding="utf8") for i in range(tokenizer.vocab_size)]
            bos_token_id = tokenizer.bos_token_id
            eos_token_id = tokenizer.eos_token_id
            vocab_size = tokenizer.vocab_size
            for v in tokenizer.get_added_vocab().values():
                if v >= vocab_size:
                    byte_tokens.append(bytes(tokenizer.convert_tokens_to_string(['a', tokenizer.convert_ids_to_tokens(v)])[1:], encoding="utf8"))

After reinstalling the package with this change, it seems to work. I suppose similar logic can be implemented for other elif conditions for transformers tokenizers.

System info (please complete the following information):

oflakne26 commented 2 weeks ago

It's September of 2024 and I'm still running into this issue through the official release, attempting to load a GGUF with a GrammarlessTokenizer from Transformer's AutoTokenizer.

I'll try using the recommended solution.

Edit: Works like a charm! Thank you so much! :)