getting issues with tokenizer

Anushagudipati commented 2 months ago

unable load Tokenizer using AutoTokenizer.from_pretrained()

errors: tokenizer = AutoTokenizer.from_pretrained(model_id) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, *kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained return cls._from_pretrained( File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 120, in init raise ValueError( ValueError: Couldn't instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

+++++++++++++++++++++++++++++++++++

config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 6.03MB/s] special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 797kB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0k/51.0k [00:00<00:00, 55.3MB/s] The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'PreTrainedTokenizerFast'. The class this function is called from is 'LlamaTokenizer'. You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Traceback (most recent call last): File "/home/ubuntu/llama3-8b-base.py", line 28, in tokenizer = AutoTokenizer.from_pretrained(checkpoint_path) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 843, in from_pretrained return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, *kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained return cls._from_pretrained( File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2082, in _from_pretrained slow_tokenizer = (cls.slow_tokenizer_class)._from_pretrained( File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained tokenizer = cls(init_inputs, **init_kwargs) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in init self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False)) File "/home/ubuntu/venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 209, in get_spm_processor tokenizer.Load(self.vocab_file) File "/home/ubuntu/venv/lib/python3.10/site-packages/sentencepiece/init.py", line 961, in Load return self.LoadFromFile(model_file) File "/home/ubuntu/venv/lib/python3.10/site-packages/sentencepiece/init.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

liu904-61 commented 2 months ago

Is the problem solved?I also encountered this problem

HamidShojanazeri commented 2 months ago

@liu904-61 @Anushagudipati can you pls upgrade to the latest transformers 4.40.1. This should have the latest.

EmilyInTheUS commented 2 months ago

hey @HamidShojanazeri I am having the same issue after upgraded the transformers to 4.40.1

Xiaoyinggit commented 2 months ago

I also encountered this problem. Is the problem solved?

xieziyi881 commented 1 month ago

You need to change your function, the function I used in the .py script is LlamaTokenizer.from_pretrained() and you just need to change it to AutoTokenizer.from_pretrained().

meta-llama / llama3

getting issues with tokenizer #116