Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model`

N3RDIUM commented 4 months ago

Hey there! I'm trying to run llama3-8b-instruct with intel extension for transformers.

Here's my code:

from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
import torch

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, 
    load_in_4bit=True,
    attn_implementation="flash_attention_2",
    device_map="cpu"
)

messages = [
    {"role": "system", "content": "You are a JSON chatbot who always responds with JSON in the following format: {'message': 'your message here!'}"},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))

Here's the error:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-04-30 11:46:44 [INFO] cpu device is used.
2024-04-30 11:46:44 [INFO] Applying Weight Only Quantization.
2024-04-30 11:46:44 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 1/4 [00:02<00:07,  2.67s/it]Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model

kevinintel commented 4 months ago

Thanks for reporting it, we will check the issue

Zhenzhong1 commented 4 months ago

@N3RDIUM Hi, according to errors

Loadding the model from HF.
Loading checkpoint shards:  25%|█████████████████████████████████████████████████▊                                                                                                                                                     | 1/4 [00:02<00:07,  2.67s/it]Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llm.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It seems you did't download the model successfully. Please download the model from HF to the local disk and try again.

Just setting the model_id to the local path.

model_id = "/home/model/llama3_8b_instruct-chat"

Another issue is that the variable of model.device you didn't define

N3RDIUM commented 3 months ago

I tried downloading the model again and using the local path as the model ID, but it gives me this error now:


2024-05-17 11:29:11 [INFO] cpu device is used.
2024-05-17 11:29:11 [INFO] Applying Weight Only Quantization.
2024-05-17 11:29:11 [INFO] Quantize model by Neural Speed with RTN Algorithm.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', '/home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/']
Loadding the model from the local path.
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00001-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00002-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00003-of-00004.safetensors
Loading model file /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298/model-00004-of-00004.safetensors
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1489, in <module>
    main()
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main
    vocab = load_vocab(vocab_dir, params.n_vocab)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1380, in load_vocab
    raise FileNotFoundError(
FileNotFoundError: Could not find tokenizer.model in /home/n3rdium/llama3-8b/models--meta-llama--Meta-Llama-3-8B-Instruct/snapshots/e5e23bbe8e749ef0efcf16cad411a7d23bd23298 or its parent; if it's in another directory,                 pass the directory as --vocab-dir
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 604, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 182, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
           ^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Fail to convert pytorch model```

N3RDIUM commented 3 months ago

Does this lib support *.pth models? I could go for the original/ dir: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main/original

Zhenzhong1 commented 3 months ago

@N3RDIUM

Hi,

File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1474, in main vocab = load_vocab(vocab_dir, params.n_vocab)

The code you provided may be incompatible, whcih means ITREX or Neural Spedd verison is a little bit old. https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py

I ran the code successfully last time I replied you~. Please try to reinstall the latest main bracnh ITREX and neural speed from the souce code~

N3RDIUM commented 3 months ago

Okay, will try. Thanks for the quick reply!

N3RDIUM commented 3 months ago

Its running out of memory on python -m neural_speed.convert.convert_llama --outfile runtime_outs/ne_llama_f16.bin --outtype f16 --model_hub huggingface meta-llama/Meta-Llama-3-8B-Instruct

N3RDIUM commented 3 months ago

Whoops! Closed it by mistake. Anyway, is there any way to reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(

N3RDIUM commented 3 months ago

Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

Zhenzhong1 commented 3 months ago

Hi, @N3RDIUM

reduce memory usage when loading the model from HF? I tried without itrex and it runs just fine :(

All people use the same function to load the model from the HF: model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, load_in_4bit=True, attn_implementation="flash_attention_2", device_map="cpu" )

The possible different is that the https://github.com/intel/neural-speed/blob/main/neural_speed/convert/convert_llama.py#L1485

Please set the low_cpu_usage_mem=False before installation. According to my tests previously, it can reduce virtual memory sometimes.

Great, now I get AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

No worries. Just setting the new conda env and reinstall the requirement.txt and ITREX+NS from the souce code. Theses issues will disappear I think. I have checked the installation pipeline again by using the latest ITREX and NS branch. It works.

Convert:

Quant:

Inference:

successful Installation screenshots(Check whether you install successfully) ITREX:

NS:

Version:

N3RDIUM commented 3 months ago

I have the same versions as you, yet it gives me the same error: AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?

N3RDIUM commented 3 months ago

Oops, did it again, extremely sorry

N3RDIUM commented 3 months ago

I'm not using conda, just python venv. Does that have something to do with this?

N3RDIUM commented 3 months ago

Here is the error now:

(.venv) .venv ❯ /mnt/code/Code/jarvis/.venv/bin/python /mnt/code/Code/jarvis/llama3.py
_zsh_autosuggest_highlight_reset:3: maximum nested function level reached; increase FUNCNEST?
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-17 13:58:42 [INFO] cpu device is used.
2024-05-17 13:58:42 [INFO] Applying Weight Only Quantization.
2024-05-17 13:58:42 [INFO] Quantize model by Neural Speed with RTN Algorithm.
The model_type: Llama3.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
cmd: ['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']
Loadding the model from HF.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 19.01it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1526, in <module>
    main()
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py", line 1490, in main
    cache_path = Path(tokenizer.vocab_file).parent
                      ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'?
Traceback (most recent call last):
  File "/mnt/code/Code/jarvis/llama3.py", line 8, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 633, in from_pretrained
    model.init( # pylint: disable=E1123
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/__init__.py", line 205, in init
    convert_model(model_name, fp32_bin, "f32", model_hub=model_hub)
  File "/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/__init__.py", line 55, in convert_model
    subprocess.run(cmd, check=True)
  File "/usr/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', PosixPath('/mnt/code/Code/jarvis/.venv/lib/python3.12/site-packages/neural_speed/convert/convert_llama.py'), '--outfile', 'runtime_outs/ne_llama_f32.bin', '--outtype', 'f32', '--model_hub', 'huggingface', 'meta-llama/Meta-Llama-3-8B-Instruct']' returned non-zero exit status 1.

N3RDIUM commented 3 months ago

Which version of transformers and pytorch are you on?

Zhenzhong1 commented 3 months ago

AttributeError: 'PreTrainedTokenizerFast' object has no attribute 'vocab_file'. Did you mean: 'vocab_size'? this error looks about transformers probably.

try this

Ujjawal-K-Panchal commented 3 months ago

Facing the same issue for the given Dockerfile.

intel / intel-extension-for-transformers

Cannot run llama3 8b instruct: `AssertionError: Fail to convert pytorch model` #1522