No Transformer-based API LLM Examples Working

The Transformer Python API section is not working. I've tried Python 3.7, 3.8, 3.10 and 3.11.

I am running this with Ubuntu on Intel Sapphire Rapid CPUs.

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

results in:

2023-11-30 20:50:43 [INFO] CPU device is used.
2023-11-30 20:51:33 [INFO] Applying Weight Only Quantization.
2023-11-30 20:51:33 [INFO] Using LLM runtime.
Traceback (most recent call last):
  File "scratch/step2.py", line 10, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/llm-examples/lib/python3.8/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 136, in from_pretrained
    quantization_config.post_init_runtime()
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/llm-examples/lib/python3.8/site-packages/intel_extension_for_transformers/transformers/utils/quantization_config.py", line 127, in post_init_runtime
    raise ValueError(f"weight_dtype must be 'int4', 'int8'.")
ValueError: weight_dtype must be 'int4', 'int8'.

I have been unable to get any Transformer-based API examples working.

I have been able to make the scripts work as per description. So I know my system hardware and kernel configuration do work.

I had originally been trying to get python_api_example.py working, but this too is broken:

Traceback (most recent call last):
  File "scripts/python_api_example.py", line 29, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/try2/lib/python3.8/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 128, in from_pretrained
    model = cls.ORIG_MODEL.from_pretrained(
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/try2/lib/python3.8/site-packages/transformers/models/auto/auto_factory.py", line 553, in from_pretrained
    model_class = get_class_from_dynamic_module(
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/try2/lib/python3.8/site-packages/transformers/dynamic_module_utils.py", line 499, in get_class_from_dynamic_module
    return get_class_in_module(class_name, final_module.replace(".py", ""))
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/try2/lib/python3.8/site-packages/transformers/dynamic_module_utils.py", line 199, in get_class_in_module
    module = importlib.import_module(module_path)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/home/caseyhoward/.cache/huggingface/modules/transformers_modules/Intel/neural-chat-7b-v1-1/b80b062deb3220b71dc9dfd4e093b298541961f1/modeling_mpt.py", line 18, in <module>
    from .hf_prefixlm_converter import add_bidirectional_mask_if_missing, convert_hf_causal_lm_to_prefix_lm
  File "/home/caseyhoward/.cache/huggingface/modules/transformers_modules/Intel/neural-chat-7b-v1-1/b80b062deb3220b71dc9dfd4e093b298541961f1/hf_prefixlm_converter.py", line 15, in <module>
    from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
ImportError: cannot import name '_expand_mask' from 'transformers.models.bloom.modeling_bloom' (/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/try2/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py)

Some other models work, but it's hit and miss. For example model meta-llama/Llama-2-7b-chat-hf also fails:

python scripts/python_api_example.py 
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.39it/s]
2023-11-30 22:36:41 [INFO] Applying Weight Only Quantization.
2023-11-30 22:36:41 [INFO] Using LLM runtime.
cmd: ['python', PosixPath('/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_llama.py'), '--outfile', 'ne_llama_f32.bin', '--outtype', 'f32', 'meta-llama/Llama-2-7b-chat-hf']
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Loading model file meta-llama/Llama-2-7b-chat-hf
Traceback (most recent call last):
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_llama.py", line 1271, in <module>
    main()
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_llama.py", line 1251, in main
    model_plus = load_some_model(args.model)
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_llama.py", line 1177, in load_some_model
    models_plus.append(lazy_load_file(path))
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/llm/runtime/graph/scripts/convert_llama.py", line 945, in lazy_load_file
    fp = open(path, 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'meta-llama/Llama-2-7b-chat-hf'
Traceback (most recent call last):
  File "scripts/python_api_example.py", line 29, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=woq_config, trust_remote_code=True)
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 140, in from_pretrained
    model.init(
  File "/home/caseyhoward/intel-extension-for-transformers/intel_extension_for_transformers/llm/runtime/graph/venv/lib/python3.8/site-packages/intel_extension_for_transformers/llm/runtime/graph/__init__.py", line 74, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
AssertionError: Fail to convert pytorch model

Hi, for this error:

raise ValueError(f"weight_dtype must be 'int4', 'int8'.") ValueError: weight_dtype must be 'int4', 'int8'.

The reason was that 'pip install intel-extension-for-transformers' installed the previous version so that it can't support the latest int4 or int8 quantize feature. That's also one of reasons why you can't get any Transformer-based API examples working.

Please install the ITREX from the source code and have a try~ I have run the whole installation process. It's ok if you follow the commands.

git clone https://github.com/intel/intel-extension-for-transformers
cd intel-extension-for-transformers
pip install -r requirements.txt
pip install transformers==4.33.1
python setup.py install

Please put your script outside of the intel-extension-for-transformers root directory to make sure the script calls API from the conda env you installed rather than local ITREX directories.

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Inference screenshot:

I get

python3.8 -m venv venv
. venv/bin/activate
git clone https://github.com/intel/intel-extension-for-transformers
cd intel-extension-for-transformers
pip install -r requirements.txt
pip install transformers==4.33.1
python setup.py install
error: huggingface-hub 0.19.4 is installed but huggingface_hub<0.18,>=0.16.4 is required by {'tokenizers'}

Attempting to resolve that error with

pip install huggingface-hub==0.17.3

#Output
Collecting huggingface-hub==0.17.3
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 6.1 MB/s eta 0:00:00
Requirement already satisfied: packaging>=20.9 in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (23.2)
Requirement already satisfied: requests in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (2.31.0)
Requirement already satisfied: fsspec in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (2023.10.0)
Requirement already satisfied: filelock in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (3.13.1)
Requirement already satisfied: tqdm>=4.42.1 in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (4.66.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (4.8.0)
Requirement already satisfied: pyyaml>=5.1 in ./venv/lib/python3.8/site-packages (from huggingface-hub==0.17.3) (6.0.1)
Requirement already satisfied: certifi>=2017.4.17 in ./venv/lib/python3.8/site-packages (from requests->huggingface-hub==0.17.3) (2023.11.17)
Requirement already satisfied: idna<4,>=2.5 in ./venv/lib/python3.8/site-packages (from requests->huggingface-hub==0.17.3) (3.6)
Requirement already satisfied: charset-normalizer<4,>=2 in ./venv/lib/python3.8/site-packages (from requests->huggingface-hub==0.17.3) (3.3.2)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./venv/lib/python3.8/site-packages (from requests->huggingface-hub==0.17.3) (2.1.0)
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.19.4
    Uninstalling huggingface-hub-0.19.4:
      Successfully uninstalled huggingface-hub-0.19.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.15.0 requires huggingface-hub>=0.18.0, but you have huggingface-hub 0.17.3 which is incompatible.
intel-extension-for-transformers 1.3rc2.dev30+g318e5cbf22 requires transformers==4.34.1, but you have transformers 4.33.1 which is incompatible.
Successfully installed huggingface-hub-0.17.3

Rerunning:

python setup.py install

results in:

Beginning with Matplotlib 3.8, Python 3.9 or above is required.
You are using Python 3.8.18.

This may be due to an out of date pip.

Make sure you have pip >= 9.0.1.

So retrying with Python 3.9:

sudo apt install python3.9 python3.9-venv python3.9-dev python3-pip;
python3.9 -m venv python3.9-venv
. python3.9-venv/bin/activate
pip install -U pip
pip install -r requirements.txt
pip install transformers==4.33.1
python setup.py install

Complains about:

error: huggingface-hub 0.19.4 is installed but huggingface_hub<0.18,>=0.16.4 is required by {'tokenizers'}

So I install the preferred huggingface-hub version and setup.py install:

pip install huggingface-hub==0.17.3
python setup.py install

# output
Collecting huggingface-hub==0.17.3
  Downloading huggingface_hub-0.17.3-py3-none-any.whl.metadata (13 kB)
Requirement already satisfied: filelock in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (3.13.1)
Requirement already satisfied: fsspec in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (2023.10.0)
Requirement already satisfied: requests in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (2.31.0)
Requirement already satisfied: tqdm>=4.42.1 in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (4.66.1)
Requirement already satisfied: pyyaml>=5.1 in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (6.0.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (4.8.0)
Requirement already satisfied: packaging>=20.9 in ./python3.9-venv/lib/python3.9/site-packages (from huggingface-hub==0.17.3) (23.2)
Requirement already satisfied: charset-normalizer<4,>=2 in ./python3.9-venv/lib/python3.9/site-packages (from requests->huggingface-hub==0.17.3) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in ./python3.9-venv/lib/python3.9/site-packages (from requests->huggingface-hub==0.17.3) (3.6)
Requirement already satisfied: urllib3<3,>=1.21.1 in ./python3.9-venv/lib/python3.9/site-packages (from requests->huggingface-hub==0.17.3) (2.1.0)
Requirement already satisfied: certifi>=2017.4.17 in ./python3.9-venv/lib/python3.9/site-packages (from requests->huggingface-hub==0.17.3) (2023.11.17)
Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 295.0/295.0 kB 7.0 MB/s eta 0:00:00
Installing collected packages: huggingface-hub
  Attempting uninstall: huggingface-hub
    Found existing installation: huggingface-hub 0.19.4
    Uninstalling huggingface-hub-0.19.4:
      Successfully uninstalled huggingface-hub-0.19.4
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.15.0 requires huggingface-hub>=0.18.0, but you have huggingface-hub 0.17.3 which is incompatible.
intel-extension-for-transformers 1.3rc2.dev30+g318e5cbf22 requires transformers==4.34.1, but you have transformers 4.33.1 which is incompatible.
Successfully installed huggingface-hub-0.17.3

# Outputs redacted of copious quantities of log lines
Using /home/caseyhoward/intel-extension-for-transformers/python3.9-venv/lib/python3.9/site-packages
Finished processing dependencies for intel-extension-for-transformers==1.3rc2.dev30+g318e5cbf22

Running:

# example.py
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Results in:

Traceback (most recent call last):
# ...
# Stacktrace removed
# ...
  File "/home/caseyhoward/intel-extension-for-transformers/python3.9-venv/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 179, in check_imports
    raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: einops. Run `pip install einops`
Traceback (most recent call last):
  File "/home/caseyhoward/ex.py", line 10, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
  File "/home/caseyhoward/intel-extension-for-transformers/python3.9-venv/lib/python3.9/site-packages/intel_extension_for_transformers-1.3rc2.dev30+g318e5cbf22-py3.9-linux-x86_64.egg/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 179, in from_pretrained
    model.init(
  File "/home/caseyhoward/intel-extension-for-transformers/python3.9-venv/lib/python3.9/site-packages/intel_extension_for_transformers-1.3rc2.dev30+g318e5cbf22-py3.9-linux-x86_64.egg/intel_extension_for_transformers/llm/runtime/graph/__init__.py", line 122, in init
    assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
AssertionError: Fail to convert pytorch model

Running

pip install einops
python example.py

Results in:

model_quantize_internal: model size  = 25362.62 MB
model_quantize_internal: quant size  =  4737.50 MB
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950000
model.cpp: loading model from runtime_outs/ne_mpt_q_int4_jblas_cint8_g32.bin
init: n_vocab    = 50279
init: n_embd     = 4096
init: n_mult     = 4096
init: n_head     = 32
init: n_layer    = 32
init: n_rot      = 32
init: n_ff       = 16384
init: n_parts    = 1
load: ne ctx size = 4737.55 MB
load: mem required  = 12929.55 MB (+ memory per state)
..................................................................................................
model_init_from_file: support_jblas_kv = 1
model_init_from_file: kv self size =  276.00 MB
Once upon a time, there existed a little girl, who was born in the year of the dragon. She was born in the year of the dragon because her mother was born in the year of the dragon. Her mother was born in the year of the dragon because her grandmother was born in the year of the dragon. Her grandmother was born in the year of the dragon because her grandfather was born in the year of the dragon. Her grandfather was born in the year of the dragon because his father was born in the year of the dragon. His father was born in the year of the dragon because his mother was born in the year of the dragon. Her mother was born in the year of the dragon because her father was born in the year of the dragon. Her father was born in the year of the dragon because his mother was born in the year of the dragon. Her mother was born in the year of the dragon because her father was born in the year of the dragon. Her father was born in the year of the dragon because his mother was born in the year of the dragon. Her mother was born in the year of the dragon because her father was born in the year of the dragon. Her father was born in the year of the dragon because his mother was born in the year of the dragon. Her mother was born in the year of the dragon because her father was born in the year of the dragon. Her father was born in the year of the dragon because his mother was born in the year of the dragon. Her mother was born in

It works.

Based on my experience, I have some thoughts:

It would be helpful to update intel-extension-for-transformers in PyPI and other repos (or the docs Install docs to reflect the current build process)
This project seems to have a python3.9 version floor. I don't know if that's intended (I think there are some docs that say it's 3.7+ up to 3.10)
There are some dependency issues that with einops, huggingface-hub, transformers and datasets (due to downgrading huggingface-hub)

Might there a way to update the package in PyPI and Conda repos so pip install intel-extension-for-transformers works? I think it would significantly help with adoption of this software and Intel's AMX capabilities.

Thanks very much for you valuable suggestion! We will update the README, and consider to fix following dependency issue. Regards Bo

I follow exactly what @cphoward did, but I am still encountering an error similar to him/her.

2023-12-15 16:13:10 [INFO] Applying Weight Only Quantization.
2023-12-15 16:13:10 [INFO] Using LLM runtime.
Traceback (most recent call last):
  File "/hdd4/namch/example.py", line 21, in <module>
    model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, cache_dir='/hdd4/namch/.cache')
  File "/hdd4/namch/miniconda3/envs/python3.9/lib/python3.9/site-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 136, in from_pretrained
    quantization_config.post_init_runtime()
  File "/hdd4/namch/miniconda3/envs/python3.9/lib/python3.9/site-packages/intel_extension_for_transformers/transformers/utils/quantization_config.py", line 127, in post_init_runtime
    raise ValueError(f"weight_dtype must be 'int4', 'int8'.")
ValueError: weight_dtype must be 'int4', 'int8'.

I am uncertain if my server hardware is not supported. Here is the hardware information:

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             20
On-line CPU(s) list:                0-19
Thread(s) per core:                 2
Core(s) per socket:                 10
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              165
Model name:                         Intel(R) Core(TM) i9-10900K CPU @ 3.70GHz
Stepping:                           5
CPU MHz:                            4900.000
CPU max MHz:                        5300.0000
CPU min MHz:                        800.0000
BogoMIPS:                           7399.70
Virtualization:                     VT-x
L1d cache:                          320 KiB
L1i cache:                          320 KiB
L2 cache:                           2.5 MiB
L3 cache:                           20 MiB
NUMA node0 CPU(s):                  0-19
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfm
                                    perf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ib
                                    pb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_a
                                    ct_window hwp_epp pku ospke md_clear flush_l1d arch_capabilities

Thank you so much!

@CaoHaiNam Hi, the current problem may not be your hardware. Please check my above comment in this issue to fix this error.

If your hardward does not support it, the error will be reported when you inference the model. No worries.Please feel free to comment on this issue if you encounter any problems.

Hi, for this error:

raise ValueError(f"weight_dtype must be 'int4', 'int8'.") ValueError: weight_dtype must be 'int4', 'int8'.

The reason was that 'pip install intel-extension-for-transformers' installed the previous version so that it can't support the latest int4 or int8 quantize feature. That's also one of reasons why you can't get any Transformer-based API examples working.

Please install the ITREX from the source code and have a try~ I have run the whole installation process. It's ok if you follow the commands.
git clone https://github.com/intel/intel-extension-for-transformers
cd intel-extension-for-transformers
pip install -r requirements.txt
pip install transformers==4.33.1
python setup.py install
Please put your script outside of the intel-extension-for-transformers root directory to make sure the script calls API from the conda env you installed rather than local ITREX directories.
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
Inference screenshot:

I'm following instructions.

Now I got different error

{ "name": "KeyError", "message": "'mistral'", "stack": "--------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[3], line 10 7 inputs = tokenizer(prompt, return_tensors=\"pt\").input_ids 8 streamer = TextStreamer(tokenizer) ---> 10 model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True) 11 outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

File ~/.conda/envs/gptexp/lib/python3.10/site-packages/intel_extension_for_transformers-1.4.dev20+ge6ecb21ce5-py3.10-linux-x86_64.egg/intel_extension_for_transformers/transformers/modeling/modeling_auto.py:265, in _BaseQBitsAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs) 262 from intel_extension_for_transformers.llm.runtime.graph import Model 264 model = Model() --> 265 model.init( 266 pretrained_model_name_or_path, 267 weight_dtype=quantization_config.weight_dtype, 268 alg=quantization_config.scheme, 269 group_size=quantization_config.group_size, 270 scale_dtype=quantization_config.scale_dtype, 271 compute_dtype=quantization_config.compute_dtype, 272 use_ggml=quantization_config.use_ggml, 273 use_quant=quantization_config.use_quant, 274 use_gptq=quantization_config.use_gptq, 275 ) 276 return model 277 else:

File ~/.conda/envs/gptexp/lib/python3.10/site-packages/intel_extension_for_transformers-1.4.dev20+ge6ecb21ce5-py3.10-linux-x86_64.egg/intel_extension_for_transformers/llm/runtime/graph/init.py:79, in Model.init(self, model_name, use_quant, use_gptq, quant_kwargs) 78 def init(self, model_name, use_quant=True, use_gptq=False, quant_kwargs): ---> 79 self.config = AutoConfig.from_pretrained(model_name, trust_remote_code=True) 80 self.tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) 81 self.model_type = Model.get_model_type(self.config)

File ~/.conda/envs/gptexp/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:1039, in AutoConfig.from_pretrained(cls, pretrained_model_name_or_path, kwargs) 1037 return config_class.from_pretrained(pretrained_model_name_or_path, kwargs) 1038 elif \"model_type\" in config_dict: -> 1039 config_class = CONFIG_MAPPING[config_dict[\"model_type\"]] 1040 return config_class.from_dict(config_dict, **unused_kwargs) 1041 else: 1042 # Fallback: use pattern matching on the string. 1043 # We go from longer names to shorter names to catch roberta before bert (for instance)

File ~/.conda/envs/gptexp/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:734, in _LazyConfigMapping.getitem(self, key) 732 return self._extra_content[key] 733 if key not in self._mapping: --> 734 raise KeyError(key) 735 value = self._mapping[key] 736 module_name = model_type_to_module_name(key)

KeyError: 'mistral'" }

Note:I'm using Intel Developer Cloud.

Hi, sorry to reply late because I didn't receive any email notification : (

I can't see the screenshot you shared. Please upload again.

Please provide more details so that I can reproduce your error. The script, the model name / card_id of HF and commands.

@rajivmehtaflex

hi, the reason of this error maybe the transformer version. Please try to update the latest transformer version. That's why KeyError: 'mistral'".

@rajivmehtaflex Hi, This issue has been fixed. I'll close this issue for now. If you have more questions, please feel free to ask and @ me.

intel / intel-extension-for-transformers

No Transformer-based API LLM Examples Working #831