Open nath1295 opened 7 months ago
It's September of 2024 and I'm still running into this issue through the official release, attempting to load a GGUF with a GrammarlessTokenizer from Transformer's AutoTokenizer.
I'll try using the recommended solution.
Edit: Works like a charm! Thank you so much! :)
The bug Some special tokens have ids that are out of the vocab size in transformers, this can happen with fine-tuned models with extra added special tokens to the original tokenizer. It causes the Tokenizer object failing to initialise as the ids were being used and it is out of index in self.tokens.
To Reproduce I was using the GGUF model from "TheBloke/Nous-Hermes-2-Mixtral-8x7B-DPO-GGUF" and hosted it as an OpenAI API. The tokenizer I was using was from the original repository "NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO", loaded with
transformers.AutoTokenizer.from_pretrained
This is the code I'm running, the model name I used in the API is not important, I just called it "hosted_model" for now let's say.
This is the error I got:
To dig deeper into the problem, I check the vocab size and the special tokens.
So apparently Nous added these new special tokens to comply with the ChatML format, but this is not included in the vocab size in transformer.
I further look into the code of the class
GrammarlessTokenizer
, which is responsible for dealing with tokenizers fromtransformers
, and I found this in the class__init__
code:Apparently, the eos token is not added to the
byte_tokens
list. Withhasattr(tokenizer, "convert_ids_to_tokens")
, I figured out the type of tokenizer I have with the NousHermes Mixtral model. So I modified the code in the_grammarless.py
script.After reinstalling the package with this change, it seems to work. I suppose similar logic can be implemented for other elif conditions for
transformers
tokenizers.System info (please complete the following information):
guidance.__version__
): 0.1.11 (installed from source as PyPI does not have this version.)