Open N3RDIUM opened 4 months ago
Thanks for reporting it, we will check the issue
@N3RDIUM Hi, according to errors
Loadding the model from HF.
Loading checkpoint shards: 25%|█████████████████████████████████████████████████▊ | 1/4 [00:02<00:07, 2.67s/it]Traceback (most recent call last):
File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
It seems you did't download the model successfully. Please download the model from HF to the local disk and try again.
Just setting the model_id to the local path.
model_id = "/home/model/llama3_8b_instruct-chat"
Another issue is that the variable of model.device you didn't define
I tried downloading the model again and using the local path as the model ID, but it gives me this error now:
2024-05-17 11:29:11 [INFO] cpu device is used.
2024-05-17 11:29:11 [INFO] Applying Weight Only Quantization.
2024-05-17 11:29:11 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', '/home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/']
Loadding the model from the local path.
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Traceback (most recent call last):
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1489, in <module>
main()
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main
vocab = load_vocab(vocab_dir, params.n_vocab)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1380, in load_vocab
raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298 or its parent; if it's in another directory, pass the directory as --vocab-dir
Traceback (most recent call last):
File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
model.init( # pylint: disable=E1123
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model```
Does this lib support *.pth models? I could go for the original/ dir: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main/original
@N3RDIUM
Hi,
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main vocab = load_vocab(vocab_dir, params.n_vocab)
The code you provided may be incompatible, whcih means ITREX or Neural Spedd verison is a little bit old. https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py
I ran the code successfully last time I replied you~. Please try to reinstall the latest main bracnh ITREX and neural speed from the souce code~
Okay, will try. Thanks for the quick reply!
Its running out of memory on python -m neural_speed.convert.convert_llama --outfile runtime_outs/ne_llama_f16.bin --outtype f16 --model_hub huggingface meta-llama/Meta-Llama-3-8B-Instruct
Whoops! Closed it by mistake. Anyway, is there any way to reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(
Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
Hi, @N3RDIUM
reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(
All people use the same function to load the model from the HF: model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, load_in_4bit=True, attn_implementation="flash_attention_2", device_map="cpu" )
The possible different is that the https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py#L1485
Please set the low_cpu_usage_mem=False before installation. According to my tests previously, it can reduce virtual memory sometimes.
Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
No worries. Just setting the new conda env and reinstall the requirement.txt and ITREX+NS from the souce code. Theses issues will disappear I think. I have checked the installation pipeline again by using the latest ITREX and NS branch. It works.
Convert:
Quant:
Inference:
successful Installation screenshots(Check whether you install successfully) ITREX:
NS:
Version:
I have the same versions as you, yet it gives me the same error: AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
Oops, did it again, extremely sorry
I'm not using conda
, just python venv
. Does that have something to do with this?
Here is the error now:
(.venv) .venv ❯ /mnt/code/Code/jarvis/.venv/bin/python /mnt/code/Code/jarvis/llama3.py
_zsh_autosuggest_highlight_reset:3: maximum nested function level reached; increase FUNCNEST?
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-17 13:58:42 [INFO] cpu device is used.
2024-05-17 13:58:42 [INFO] Applying Weight Only Quantization.
2024-05-17 13:58:42 [INFO] Quantize model by Neural Speed with RTN Algorithm.
The model_type: Llama3.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 19.01it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1526, in <module>
main()
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1490, in main
cache_path = Path(tokenizer.vocab_file).parent
^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
Traceback (most recent call last):
File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 633, in from_pretrained
model.init( # pylint: disable=E1123
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 205, in init
convert_model(model_name, fp32_bin, "f32", model_hub=model_hub)
File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/__init__.py", line 55, in convert_model
subprocess.run(cmd, check=True)
File "/usr/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']' returned non-zero exit status 1.
Which version of transformers and pytorch are you on?
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'? this error looks about transformers probably.
try this
Facing the same issue for the given Dockerfile.
Hey there! I'm trying to run llama3-8b-instruct with intel extension for transformers.
Here's my code:
Here's the error: