huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.66k stars 26.93k forks source link

RuntimeError: Internal: could not parse ModelProto from /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct/tokenizer.model #34017

Open Itime-ren opened 1 month ago

Itime-ren commented 1 month ago

System Info

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message. Traceback (most recent call last): File "/Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 479, in main() File "/Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 457, in main write_tokenizer( File "/Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py", line 367, in write_tokenizer tokenizer = tokenizer_class(input_tokenizer_path) File "/home/transformers/src/transformers/models/llama/tokenization_llama_fast.py", line 157, in init super().init( File "/home/transformers/src/transformers/tokenization_utils_fast.py", line 132, in init slow_tokenizer = self.slow_tokenizer_class(*args, **kwargs) File "/home/transformers/src/transformers/models/llama/tokenization_llama.py", line 171, in init self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False)) File "/home/transformers/src/transformers/models/llama/tokenization_llama.py", line 198, in get_spm_processor tokenizer.Load(self.vocab_file) File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 961, in Load return self.LoadFromFile(model_file) File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 316, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) RuntimeError: Internal: could not parse ModelProto from /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct/tokenizer.model

Who can help?

@ArthurZucker @itazap

Information

Tasks

Reproduction

python3 /Data_disk/transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py \ --input_dir /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct \ --model_size 1B \ --output_dir /Data_disk/meta_llama/meta_llama3.2/out

Expected behavior

get safetensors

LysandreJik commented 1 month ago

Hey @Itime-ren, what's the content of /Data_disk/meta_llama/meta_llama3.2/Llama3.2-1B-Instruct?

If trying to use the llama 3.2 1B Instruct, why don't you use this repo which is already transformers-compatible?

github-actions[bot] commented 1 day ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.