ipex-llm with vllm failed to run on Core ultra 7 165H iGPU

wluo1007 commented 3 weeks ago

Platform: Core ultra 7 165H iGPU

Model: Qwen/Qwen2-7B-Instruct

Following the steps on https://testbigdldocshane.readthedocs.io/en/perf-docs/doc/LLM/Quickstart/vLLM_quickstart.html#

when running python offline_inference.py, error would ocurr:

(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/vllm$ python offline_inference.py /home/user/vllm_ipex_env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /home/user/vllm_ipex_env/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-08-19 13:40:32,185 - INFO - intel_extension_for_pytorch auto imported WARNING 08-19 13:40:32 config.py:710] Casting torch.bfloat16 to torch.float16. INFO 08-19 13:40:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/user/Qwen2-7B-Instruct', tokenizer='/home/user/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 08-19 13:40:32 attention.py:71] flash_attn is not found. Using xformers backend. 2024-08-19 13:40:34,255 - INFO - Converting the current model to sym_int4 format...... 2024-08-19 13:40:34,255 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-08-19 13:40:38,071 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations INFO 08-19 13:40:40 model_convert.py:249] Loading model weights took 4.5222 GB LLVM ERROR: Diag: aborted

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763) LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 165H] Registry and code: 13 MB Command: python offline_inference.py Uptime: 19.974422 s Aborted (core dumped)

I've also tried the whole process on Data center dGPU flex, which works fine, wondering if this issue only occurs on iGPU.

hzjane commented 3 weeks ago

We haven't tested it on mtl iGPU before. I tried to reproduce it but encountered a different error. Maybe You can try it on docker according to this docker guide.

wluo1007 commented 3 weeks ago

Tried docker, still got error, do you have plans for iGPU support?

root@user-Meteor-Lake-Client-Platform:/llm# python vllm_offline_inference.py /usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-08-20 12:50:38,388 - INFO - intel_extension_for_pytorch auto imported WARNING 08-20 12:50:38 config.py:710] Casting torch.bfloat16 to torch.float16. INFO 08-20 12:50:38 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/llm/Qwen2-7B-Instruct', tokenizer='/llm/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 08-20 12:50:38 attention.py:71] flash_attn is not found. Using xformers backend. 2024-08-20 12:50:46,567 - INFO - Converting the current model to sym_int4 format...... 2024-08-20 12:50:46,568 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations [2024-08-20 12:50:46,802] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect) 2024-08-20 12:50:51,297 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations INFO 08-20 12:50:53 model_convert.py:249] Loading model weights took 4.5222 GB error: Traceback (most recent call last): File "/llm/vllm_offline_inference.py", line 48, in llm = LLM(model="/llm/Qwen2-7B-Instruct", ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 109, in init self.llm_engine = IPEXLLMLLMEngine.from_engine_args(engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 144, in from_engine_args engine = cls(engine_configs, ^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 116, in init super().init(args, kwargs) File "/llm/vllm/vllm/engine/llm_engine.py", line 106, in init self.model_executor = executor_class(model_config, cache_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/executor/gpu_executor.py", line 46, in init self._init_cache() File "/llm/vllm/vllm/executor/gpu_executor.py", line 92, in _init_cache self.driver_worker.profile_num_available_blocks( File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/worker/worker.py", line 136, in profile_num_available_blocks self.model_runner.profile_run() File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/worker/model_runner.py", line 645, in profile_run self.execute_model(seqs, kv_caches) File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/worker/model_runner.py", line 581, in execute_model hidden_states = model_executable( ^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 316, in forward hidden_states = self.model(input_ids, positions, kv_caches, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 257, in forward hidden_states, residual = layer( ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 210, in forward hidden_states = self.self_attn( ^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/models/qwen2.py", line 156, in forward attn_output = self.attn(q, k, v, k_cache, v_cache, input_metadata) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/layers/attention/attention.py", line 62, in forward return self.backend.forward(query, key, value, key_cache, value_cache, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/llm/vllm/vllm/model_executor/layers/attention/backends/torch_sdpa.py", line 97, in forward out = torch.nn.functional.scaled_dot_product_attention( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The program was built for 1 devices Build program log for 'Intel(R) Graphics [0x7d55]': warning: module got recompiled from IR because provided native binary is incompatible with underlying device and/or driver [-Wrecompiled-from-ir] IGC: LLVM Error: VISA builder API call failed: CisaBuilder->Compile( BC->isaDumpsEnabled() && BC->hasShaderDumper() ? BC->getShaderDumper().composeDumpPath("final.isaasm").c_str() : "", BC->emitVisaOnly()) -11 (PI_ERROR_BUILD_PROGRAM_FAILURE)

gc-fu commented 3 weeks ago

Hi, I have verified that vLLM works on iGPU with model chatglm3-6b on Linux and does not encounter the problem you mentioned in the thread.

The vLLM we provided does have a problem that are related to Qwen2-7B-Instruct but should not report the error in the first thread.

Can you provide the result of https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/scripts/env-check.sh?

Besides, we will check if we can fix the problem that are related to Qwen2 series model.

wluo1007 commented 3 weeks ago

Hi, Thanks for the reply, The previous environment is currently not available now, so I've installed the recent one to try on both chatglm3-6b and qwen2-7b-instruct, got the same error msg like below.

(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/vllm$ python offline_inference.py /home/user/vllm_ipex_env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /home/user/vllm_ipex_env/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source? warn( 2024-08-22 16:28:32,793 - INFO - intel_extension_for_pytorch auto imported WARNING 08-22 16:28:32 config.py:710] Casting torch.bfloat16 to torch.float16. INFO 08-22 16:28:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/user/Qwen2-7B-Instruct', tokenizer='/home/user/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 08-22 16:28:33 attention.py:71] flash_attn is not found. Using xformers backend. 2024-08-22 16:28:34,887 - INFO - Converting the current model to sym_int4 format...... 2024-08-22 16:28:34,888 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-08-22 16:28:38,493 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations INFO 08-22 16:28:41 model_convert.py:257] Loading model weights took 4.5222 GB Traceback (most recent call last): File "/home/user/vllm/offline_inference.py", line 48, in llm = LLM(model="/home/user/Qwen2-7B-Instruct", File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/vllm/xpu/engine/engine.py", line 109, in init self.llm_engine = IPEXLLMLLMEngine.from_engine_args(engine_args, File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/vllm/xpu/engine/engine.py", line 144, in from_engine_args engine = cls(engine_configs, File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/vllm/xpu/engine/engine.py", line 116, in init super().init(args, kwargs) File "/home/user/vllm/vllm/engine/llm_engine.py", line 106, in init self.model_executor = executor_class(model_config, cache_config, File "/home/user/vllm/vllm/executor/gpu_executor.py", line 46, in init self._init_cache() File "/home/user/vllm/vllm/executor/gpu_executor.py", line 92, in _init_cache self.driver_worker.profile_num_available_blocks( File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/user/vllm/vllm/worker/worker.py", line 136, in profile_num_available_blocks self.model_runner.profile_run() File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/home/user/vllm/vllm/worker/model_runner.py", line 645, in profile_run self.execute_model(seqs, kv_caches) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/user/vllm/vllm/worker/model_runner.py", line 589, in execute_model output = self.model.sample( File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/vllm/xpu/model_convert.py", line 59, in _Qwen2_sample next_tokens = self.sampler(lm_head_weight, hidden_states, File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/user/vllm/vllm/model_executor/layers/sampler.py", line 70, in forward logits = self._get_logits(hidden_states, embedding, embedding_bias) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/vllm/xpu/model_convert.py", line 83, in _sample_get_logits logits = embedding(hidden_states) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, **kwargs) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/transformers/low_bit_linear.py", line 718, in forward x = reshape_lm_head_input(x) File "/home/user/vllm_ipex_env/lib/python3.10/site-packages/ipex_llm/transformers/low_bit_linear.py", line 356, in reshape_lm_head_input x = x[:, -1, :].view(shape) IndexError: too many indices for tensor of dimension 2

Below is the result of running env-chech.sh

(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/ipex-llm/python/llm/scripts$ ./env-check.sh

PYTHON_VERSION=3.10.12

transformers=4.37.0

torch=2.1.0a0+cxx11.abi

ipex-llm Version: 2.1.0b20240821

ipex=2.1.10+xpu

CPU Information: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 22 On-line CPU(s) list: 0-21 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) Ultra 7 165H CPU family: 6 Model: 170 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 4 CPU max MHz: 5000.0000 CPU min MHz: 400.0000 BogoMIPS: 6144.00

Total CPU Memory: 62.4902 GB

Operating System: Ubuntu 22.04.4 LTS \n \l

Linux user-Meteor-Lake-Client-Platform 6.7.1-060701-generic #202401201133 SMP PREEMPT_DYNAMIC Sat Jan 20 11:43:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

CLI: Version: 1.2.38.20240718 Build ID: 0db09695

Service: Version: 1.2.38.20240718 Build ID: 0db09695 Level Zero Version: 1.16.0

Driver Version 2023.16.12.0.12_195853.xmain-hotfix Driver Version 2023.16.12.0.12_195853.xmain-hotfix Driver UUID 32342e32-362e-3330-3034-392e36000000 Driver Version 24.26.30049.6

Driver related package version: ii intel-level-zero-gpu 1.3.30049.6 amd64 Intel(R) Graphics Compute Runtime for oneAPI Level Zero.

igpu detected [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.26.30049.6] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) Graphics 1.3 [1.3.30049]

xpu-smi is properly installed.

+-----------+--------------------------------------------------------------------------------------+ | Device ID | Device Information | +-----------+--------------------------------------------------------------------------------------+ | 0 | Device Name: Intel(R) Arc(TM) Graphics | | | Vendor Name: Intel(R) Corporation | | | SOC UUID: 00000000-0000-0200-0000-00087d558086 | | | PCI BDF Address: 0000:00:02.0 | | | DRM Device: /dev/dri/card0 | | | Function Type: physical | +-----------+--------------------------------------------------------------------------------------+ GPU0 Memory size=16M

00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller]) DeviceName: To Be Filled by O.E.M. Subsystem: Intel Corporation Device 2212 Flags: bus master, fast devsel, latency 0, IRQ 214 Memory at 601a000000 (64-bit, prefetchable) [size=16M] Memory at 4000000000 (64-bit, prefetchable) [size=256M] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] Capabilities: Kernel driver in use: i915

gc-fu commented 3 weeks ago

Hi, please try ipex-llm[xpu]==2.1.0.

There is a new feature that breaks the vLLM for the version 2.1.0b20240821

Also, the 7b model might be too big. Qwen2-1.5b-Instruct might be better.

wluo1007 commented 3 weeks ago

Hi, tried Qwen2-1.5b-Instruct and chatglm3-6b, both worked. Qwen2-7B-Instruct got stucked when loading the model. I've tried Qwen2-7B before using ipex-llm(not with vllm), worked fine. Is the Size limit thing only occur in vLLM?

gc-fu commented 3 weeks ago

Hi, tried Qwen2-1.5b-Instruct and chatglm3-6b, both worked. Qwen2-7B-Instruct got stucked when loading the model. I've tried Qwen2-7B before using ipex-llm(not with vllm), worked fine. Is the Size limit thing only occur in vLLM?

There is no size limit in vLLM. Currently, I am not very sure about why Qwen2-7B-Instruct got stuck. My guess it stucks at moving model from cpu to GPU

intel-analytics / ipex-llm