Open wluo1007 opened 3 weeks ago
We haven't tested it on mtl iGPU before. I tried to reproduce it but encountered a different error. Maybe You can try it on docker according to this docker guide.
Tried docker, still got error, do you have plans for iGPU support?
root@user-Meteor-Lake-Client-Platform:/llm# python vllm_offline_inference.py
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg
or libpng
installed before building torchvision
from source?
warn(
2024-08-20 12:50:38,388 - INFO - intel_extension_for_pytorch auto imported
WARNING 08-20 12:50:38 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 08-20 12:50:38 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/llm/Qwen2-7B-Instruct', tokenizer='/llm/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-20 12:50:38 attention.py:71] flash_attn is not found. Using xformers backend.
2024-08-20 12:50:46,567 - INFO - Converting the current model to sym_int4 format......
2024-08-20 12:50:46,568 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
[2024-08-20 12:50:46,802] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to xpu (auto detect)
2024-08-20 12:50:51,297 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 08-20 12:50:53 model_convert.py:249] Loading model weights took 4.5222 GB
error: Traceback (most recent call last):
File "/llm/vllm_offline_inference.py", line 48, in
Hi, I have verified that vLLM works on iGPU with model chatglm3-6b
on Linux and does not encounter the problem you mentioned in the thread.
The vLLM we provided does have a problem that are related to Qwen2-7B-Instruct
but should not report the error in the first thread.
Can you provide the result of https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/scripts/env-check.sh?
Besides, we will check if we can fix the problem that are related to Qwen2 series model.
Hi, Thanks for the reply, The previous environment is currently not available now, so I've installed the recent one to try on both chatglm3-6b and qwen2-7b-instruct, got the same error msg like below.
(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/vllm$ python offline_inference.py
/home/user/vllm_ipex_env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/home/user/vllm_ipex_env/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg
or libpng
installed before building torchvision
from source?
warn(
2024-08-22 16:28:32,793 - INFO - intel_extension_for_pytorch auto imported
WARNING 08-22 16:28:32 config.py:710] Casting torch.bfloat16 to torch.float16.
INFO 08-22 16:28:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/user/Qwen2-7B-Instruct', tokenizer='/home/user/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 08-22 16:28:33 attention.py:71] flash_attn is not found. Using xformers backend.
2024-08-22 16:28:34,887 - INFO - Converting the current model to sym_int4 format......
2024-08-22 16:28:34,888 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
2024-08-22 16:28:38,493 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
INFO 08-22 16:28:41 model_convert.py:257] Loading model weights took 4.5222 GB
Traceback (most recent call last):
File "/home/user/vllm/offline_inference.py", line 48, in
Below is the result of running env-chech.sh
Operating System: Ubuntu 22.04.4 LTS \n \l
CLI: Version: 1.2.38.20240718 Build ID: 0db09695
00:02.0 VGA compatible controller: Intel Corporation Device 7d55 (rev 08) (prog-if 00 [VGA controller])
DeviceName: To Be Filled by O.E.M.
Subsystem: Intel Corporation Device 2212
Flags: bus master, fast devsel, latency 0, IRQ 214
Memory at 601a000000 (64-bit, prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=256M]
Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
Capabilities:
Hi, please try ipex-llm[xpu]==2.1.0
.
There is a new feature that breaks the vLLM for the version 2.1.0b20240821
Also, the 7b model might be too big. Qwen2-1.5b-Instruct
might be better.
Hi, tried Qwen2-1.5b-Instruct and chatglm3-6b, both worked. Qwen2-7B-Instruct got stucked when loading the model. I've tried Qwen2-7B before using ipex-llm(not with vllm), worked fine. Is the Size limit thing only occur in vLLM?
Hi, tried Qwen2-1.5b-Instruct and chatglm3-6b, both worked. Qwen2-7B-Instruct got stucked when loading the model. I've tried Qwen2-7B before using ipex-llm(not with vllm), worked fine. Is the Size limit thing only occur in vLLM?
There is no size limit in vLLM
. Currently, I am not very sure about why Qwen2-7B-Instruct got stuck. My guess it stucks at moving model from cpu to GPU
Platform: Core ultra 7 165H iGPU
Model: Qwen/Qwen2-7B-Instruct
Following the steps on https://testbigdldocshane.readthedocs.io/en/perf-docs/doc/LLM/Quickstart/vLLM_quickstart.html#
when running python offline_inference.py, error would ocurr:
(vllm_ipex_env) user@user-Meteor-Lake-Client-Platform:~/vllm$ python offline_inference.py /home/user/vllm_ipex_env/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /home/user/vllm_ipex_env/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from
torchvision.io
, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you havelibjpeg
orlibpng
installed before buildingtorchvision
from source? warn( 2024-08-19 13:40:32,185 - INFO - intel_extension_for_pytorch auto imported WARNING 08-19 13:40:32 config.py:710] Casting torch.bfloat16 to torch.float16. INFO 08-19 13:40:32 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='/home/user/Qwen2-7B-Instruct', tokenizer='/home/user/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=xpu, seed=0, max_num_batched_tokens=32768, max_num_seqs=256, max_model_len=32768) Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. INFO 08-19 13:40:32 attention.py:71] flash_attn is not found. Using xformers backend. 2024-08-19 13:40:34,255 - INFO - Converting the current model to sym_int4 format...... 2024-08-19 13:40:34,255 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations 2024-08-19 13:40:38,071 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations INFO 08-19 13:40:40 model_convert.py:249] Loading model weights took 4.5222 GB LLVM ERROR: Diag: abortedLIBXSMM_VERSION: main_stable-1.17-3651 (25693763) LIBXSMM_TARGET: adl [Intel(R) Core(TM) Ultra 7 165H] Registry and code: 13 MB Command: python offline_inference.py Uptime: 19.974422 s Aborted (core dumped)
I've also tried the whole process on Data center dGPU flex, which works fine, wondering if this issue only occurs on iGPU.