intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library
Apache License 2.0
372 stars 27 forks source link

Does this library support Qwen/Qwen2-7B-Instruct? #85

Open qwebug opened 4 days ago

qwebug commented 4 days ago

When I tested Qwen2-7B on this library, it reported some errors.

from transformers import AutoModelForCausalLM, AutoTokenizer
from intel_npu_acceleration_library import NPUModelForCausalLM, int4

model_id = "Qwen/Qwen2-7B-Instruct"

model = NPUModelForCausalLM.from_pretrained(model_id, dtype=int4, use_cache=True, force_download=True).eval()
tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
>python qwen2.py
Compiling model Qwen/Qwen2-7B-Instruct int4 for the NPU
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 663/663 [00:00<?, ?B/s]
D:\Anaconda\envs\intel-npu\lib\site-packages\huggingface_hub\file_download.py:157: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\xxx\.cache\huggingface\hub\models--Qwen--Qwen2-7B-Instruct. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
D:\Anaconda\envs\intel-npu\lib\site-packages\huggingface_hub\file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 663/663 [00:00<?, ?B/s]
model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 27.8k/27.8k [00:00<00:00, 1.45MB/s]
model-00001-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3.95G/3.95G [34:31<00:00, 1.90MB/s]
model-00002-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3.86G/3.86G [06:10<00:00, 10.4MB/s]
model-00003-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3.86G/3.86G [07:02<00:00, 9.16MB/s]
model-00004-of-00004.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 3.56G/3.56G [05:26<00:00, 10.9MB/s]
Downloading shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [53:16<00:00, 799.16s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:08<00:00, 17.20s/it]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 243/243 [00:00<00:00, 243kB/s]
Exporting model Qwen/Qwen2-7B-Instruct to cache\models\Qwen_Qwen2-7B-Instruct_d64faeef4d03155cf9c03fdc8c2870328572c61dfefd15113ae308178b986f27_v1.3.0
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.29k/1.29k [00:00<?, ?B/s]
vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.78M/2.78M [00:01<00:00, 2.51MB/s]
merges.txt: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.67M/1.67M [00:01<00:00, 1.55MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.03M/7.03M [00:01<00:00, 4.40MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
loc(fused<{name = "MatMul_48", type = "MatMul"}>["MatMul_48", "as_convolution"]): error: Got wrong shape for NCE Convolution 'filter' '[3584, 6320, 1, 1]', expected '[3584, 1, 1, 6336]'
Traceback (most recent call last):
  File "D:\Desktop\workspace\NPU\intel-npu\intel-npu-acceleration-library\examples\qwen2.py", line 27, in <module>
    generated_ids = model.generate(
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\transformers\generation\utils.py", line 1736, in generate
    result = self._sample(
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\transformers\generation\utils.py", line 2375, in _sample
    outputs = self(
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 1149, in forward
    outputs = self.model(
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 1034, in forward
    layer_outputs = decoder_layer(
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 761, in forward
    hidden_states = self.mlp(hidden_states)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\transformers\models\qwen2\modeling_qwen2.py", line 179, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\intel_npu_acceleration_library\nn\linear.py", line 158, in forward
    out = run_matmul(x, self.weight, self.scale, self.op_id)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\intel_npu_acceleration_library\backend\runtime.py", line 105, in run_matmul
    _model_cache[key] = deque([create_op(inC, outC, batch)])
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\intel_npu_acceleration_library\backend\qlinear.py", line 39, in __init__
    self.compile()
  File "D:\Anaconda\envs\intel-npu\lib\site-packages\intel_npu_acceleration_library\backend\factory.py", line 755, in compile
    backend_lib.compile(self._mm)
OSError: [WinError -529697949] Windows Error 0xe06d7363
alessandropalla commented 2 days ago

Very interesting. What driver version do you have?

qwebug commented 2 days ago

npu_win_32.0.100.2540

alessandropalla commented 2 days ago

I can replicate the error, I'll take a look

qwebug commented 2 days ago

I also found another error, when testing MiniCPM-Llama3-V-2_5 on this library.

# test.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
import intel_npu_acceleration_library
from intel_npu_acceleration_library import NPUModelForCausalLM, int4

model_id = 'openbmb/MiniCPM-Llama3-V-2_5'
model = AutoModel.from_pretrained(model_id, trust_remote_code=True, torch_dtype=torch.float16)
model = intel_npu_acceleration_library.compile(model, dtype=int4)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

image = Image.open('australia.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': question}]

res = model.chat(
    image=image,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True, # if sampling=False, beam_search will be used by default
    temperature=0.7,
    # system_prompt='' # pass system_prompt if needed
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=image,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.7,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')
>python miniCPM.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 7/7 [00:14<00:00,  2.09s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "D:\Desktop\workspace\NPU\intel-npu\intel-npu-acceleration-library\examples\miniCPM.py", line 19, in <module>
    res = model.chat(
  File "C:\Users\xxx\.cache\huggingface\modules\transformers_modules\openbmb\MiniCPM-Llama3-V-2_5\45387f99a455e11801b78a0b24811856688e0c8b\modeling_minicpmv.py", line 454, in chat
    res, vision_hidden_states = self.generate(
  File "C:\Users\xxx\.cache\huggingface\modules\transformers_modules\openbmb\MiniCPM-Llama3-V-2_5\45387f99a455e11801b78a0b24811856688e0c8b\modeling_minicpmv.py", line 354, in generate
    ) = self.get_vllm_embedding(model_inputs)
  File "C:\Users\xxx\.cache\huggingface\modules\transformers_modules\openbmb\MiniCPM-Llama3-V-2_5\45387f99a455e11801b78a0b24811856688e0c8b\modeling_minicpmv.py", line 99, in get_vllm_embedding
    vision_embedding = self.vpm(all_pixel_values.type(dtype), patch_attention_mask=patch_attn_mask).last_hidden_state
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\transformers\models\idefics2\modeling_idefics2.py", line 715, in forward
    hidden_states = self.embeddings(pixel_values=pixel_values, patch_attention_mask=patch_attention_mask)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\transformers\models\idefics2\modeling_idefics2.py", line 167, in forward
    patch_embeds = self.patch_embedding(pixel_values)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\modules\module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\modules\module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\intel_npu_acceleration_library\nn\conv.py", line 112, in forward
    inp_unf = torch.nn.functional.unfold(
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\functional.py", line 4814, in unfold
    return handle_torch_function(
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\overrides.py", line 1619, in handle_torch_function
    result = mode.__torch_function__(public_api, types, args, kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\intel_npu_acceleration_library\device.py", line 66, in __torch_function__
    return super_fn(*args, **kwargs or {})
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\intel_npu_acceleration_library\device.py", line 60, in super_fn
    return func(*args, **kwargs)
  File "D:\Anaconda\envs\intel-npu-pure\lib\site-packages\torch\nn\functional.py", line 4817, in unfold
    return torch._C._nn.im2col(input, _pair(kernel_size), _pair(dilation), _pair(padding), _pair(stride))
TypeError: im2col(): argument 'padding' (position 4) must be tuple of ints, but found element of type str at pos 0