Open raffenet opened 4 months ago
Also, if I hack the bad return value to be what I think it expected, I run into this backtrace later in the execution.
Traceback (most recent call last):
File "/home/raffenet/proj/vllm/examples/offline_inference.py", line 17, in <module>
outputs = llm.generate(prompts, sampling_params)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/utils.py", line 838, in inner
return fn(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/entrypoints/llm.py", line 316, in generate
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/entrypoints/llm.py", line 569, in _run_engine
step_outputs = self.llm_engine.step()
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 911, in step
output = self.model_executor.execute_model(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/executor/distributed_gpu_executor.py", line 70, in execute_model
self.parallel_worker_tasks = self._run_workers(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/executor/ray_xpu_executor.py", line 312, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/worker/worker_base.py", line 383, in execute_method
raise e
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/worker/worker_base.py", line 374, in execute_method
return executor(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
TypeError: WorkerBase.start_worker_execution_loop() got an unexpected keyword argument 'async_run_tensor_parallel_workers_only'
@jikunshang are these issues addressed in https://github.com/vllm-project/vllm/pull/5685?
@jikunshang are these issues addressed in #5685?
yes, I have fixed tensor parallel support issue, please try this PR.
@jikunshang are these issues addressed in #5685?
yes, I have fixed tensor parallel support issue, please try this PR.
I have tested it on my system and it does indeed work with tp>1. Thanks! I hope it can be merged and made available in a future release.
@jikunshang another bit of info. Running llama-2-7b with tensor parallel 2 and 4 works on my system, but on the same system trying to running llama-3-8b with with tp=2 results in this error. Is there anything I should try?
Traceback (most recent call last):
File "/home/raffenet/proj/ipex-vllm/benchmark-scripts/offline_inference.py", line 87, in <module>
llm = LLM(model=args.model,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/entrypoints/llm.py", line 156, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 444, in from_engine_args
engine = cls(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 264, in __init__
self._initialize_kv_caches()
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 363, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/executor/ray_gpu_executor.py", line 371, in _run_workers
self.driver_worker.execute_method(method, *driver_args,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/worker_base.py", line 382, in execute_method
raise e
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/worker_base.py", line 373, in execute_method
return executor(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/xpu_worker.py", line 129, in determine_num_available_blocks
self.model_runner.profile_run()
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 223, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 375, in execute_model
hidden_or_intermediate_states = model_executable(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 420, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 320, in forward
hidden_states, residual = layer(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 243, in forward
hidden_states = self.self_attn(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 172, in forward
q, k = self.rotary_emb(positions, q, k)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/custom_op.py", line 13, in forward
return self._forward_method(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/layers/rotary_embedding.py", line 243, in forward_xpu
ops.rotary_embedding(positions, query, key, self.head_size,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/_ipex_ops.py", line 158, in rotary_embedding
ipex.llm.functional.rotary_embedding(query_rot, key_rot, sin, cos,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 47, in rotary_embedding
return RotaryEmbedding.apply_function(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 119, in apply_function
query, key = runtime_module.rotary_embedding(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 79, in rotary_embedding
torch.ops.torch_ipex.apply_rotary_embedding_half_qk(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 2
@raffenet Thanks for your evaluation. Current ipex kernel do not support GQA yet, which is widely used in latest models like llama3-8b, llama-2-70b. (I am not certain about llama-2-13b) We have verified GQA functionality with an internal version ipex. Next ipex release will fix this issue and should be end of this month.
CC @jgong5 @rogerxfeng8
GQA support for vLLM will be available in coming 2.3.110 IPEX release.
GQA support for vLLM will be available in coming 2.3.110 IPEX release.
Thanks! I look forward to trying it.
Hi, When I run vllm-xpu with Qwen2. I met the same error with GQA.
File "/workspace/vllm/vllm/_ipex_ops.py", line 158, in rotary_embedding
ipex.llm.functional.rotary_embedding(query_rot, key_rot, sin, cos,
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 47, in rotary_embedding
return RotaryEmbedding.apply_function(
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 119, in apply_function
query, key = runtime_module.rotary_embedding(
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 79, in rotary_embedding
torch.ops.torch_ipex.apply_rotary_embedding_half_qk(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 692, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: The size of tensor a (2) must match the size of tensor b (14) at non-singleton dimension 2
I am wondering when will the 2.3.110 IPEX be released.
ipex 2.3 is released and related code change is merged in vllm main branch, issues in this thread should be resolved. Please take a try thanks!
I don't want to hijack this issue, but I'm facing the same issue as in the title. Running this on tag v0.6.2
but using the current main branch Dockerfile.xpu
to build (to make sure IPEX 2.3 is being used) is still failing on my end. The build works, but the following command fails in a docker compose
:
services:
vllm-server:
build:
context: ./vllm
dockerfile: Dockerfile.xpu
container_name: vllm-server
ports:
- "8000:8000"
restart: unless-stopped
devices:
- /dev/dri:/dev/dri
volumes:
- /dev/dri/by-path:/dev/dri/by-path
network_mode: host
ipc: host
shm_size: 17179869184 # 16 GB
command: --model Qwen/Qwen2.5-72B-Instruct --device xpu --tensor-parallel-size 2
The error I'm getting:
INFO 10-02 23:12:43 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/e1bb2060-ef4c-401c-b46f-661022722fe5 for IPC Path.
INFO 10-02 23:12:43 api_server.py:177] Started engine process with PID 76
INFO 10-02 23:12:44 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-02 23:12:44 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
WARNING 10-02 23:12:44 _custom_ops.py:18] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
Process SpawnProcess-1:
INFO 10-02 23:12:45 config.py:899] Defaulting to use mp for distributed inference
WARNING 10-02 23:12:45 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
ERROR 10-02 23:12:45 llm_engine.py:530] Both start methods (spawn and fork) have issue on XPU if you use mp backend, Please try ray instead.
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
File "/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 136, in from_engine_args
executor_class = LLMEngine._get_executor_cls(engine_config)
File "/workspace/vllm/vllm/engine/llm_engine.py", line 550, in _get_executor_cls
return executor_class
UnboundLocalError: local variable 'executor_class' referenced before assignment
Snippet of pip freeze
from within the image:
pip3 freeze | grep intel
intel-cmplr-lib-rt==2024.2.1
intel-cmplr-lib-ur==2024.2.1
intel-cmplr-lic-rt==2024.2.1
intel-opencl-rt==2024.2.1
intel-openmp==2024.2.1
intel-sycl-rt==2024.2.1
intel_extension_for_pytorch==2.3.110+xpu
I don't want to hijack this issue, but I'm facing the same issue as in the title. Running this on tag
v0.6.2
but using the current main branchDockerfile.xpu
to build (to make sure IPEX 2.3 is being used) is still failing on my end. The build works, but the following command fails in adocker compose
:services: vllm-server: build: context: ./vllm dockerfile: Dockerfile.xpu container_name: vllm-server ports: - "8000:8000" restart: unless-stopped devices: - /dev/dri:/dev/dri volumes: - /dev/dri/by-path:/dev/dri/by-path network_mode: host ipc: host shm_size: 17179869184 # 16 GB command: --model Qwen/Qwen2.5-72B-Instruct --device xpu --tensor-parallel-size 2
The error I'm getting:
INFO 10-02 23:12:43 api_server.py:164] Multiprocessing frontend to use ipc:///tmp/e1bb2060-ef4c-401c-b46f-661022722fe5 for IPC Path. INFO 10-02 23:12:43 api_server.py:177] Started engine process with PID 76 INFO 10-02 23:12:44 config.py:899] Defaulting to use mp for distributed inference WARNING 10-02 23:12:44 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms. WARNING 10-02 23:12:44 _custom_ops.py:18] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") Process SpawnProcess-1: INFO 10-02 23:12:45 config.py:899] Defaulting to use mp for distributed inference WARNING 10-02 23:12:45 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms. ERROR 10-02 23:12:45 llm_engine.py:530] Both start methods (spawn and fork) have issue on XPU if you use mp backend, Please try ray instead. Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, File "/workspace/vllm/vllm/engine/multiprocessing/engine.py", line 136, in from_engine_args executor_class = LLMEngine._get_executor_cls(engine_config) File "/workspace/vllm/vllm/engine/llm_engine.py", line 550, in _get_executor_cls return executor_class UnboundLocalError: local variable 'executor_class' referenced before assignment
Snippet of
pip freeze
from within the image:pip3 freeze | grep intel intel-cmplr-lib-rt==2024.2.1 intel-cmplr-lib-ur==2024.2.1 intel-cmplr-lic-rt==2024.2.1 intel-opencl-rt==2024.2.1 intel-openmp==2024.2.1 intel-sycl-rt==2024.2.1 intel_extension_for_pytorch==2.3.110+xpu
thanks for reporting this, please try with this PR https://github.com/vllm-project/vllm/pull/8884.
or just set distributed_executor_backend
to ray
, default value is mp
which is not supported.
Your current environment
Intel GPU info from
sycl-ls
🐛 Describe the bug
offline_inference.py example crashes with
tensor_parallel_size=2
.example output
Manually printing the values being concatenated when the error occurs: