Open raffenet opened 1 month ago
Also, if I hack the bad return value to be what I think it expected, I run into this backtrace later in the execution.
Traceback (most recent call last):
File "/home/raffenet/proj/vllm/examples/offline_inference.py", line 17, in <module>
outputs = llm.generate(prompts, sampling_params)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/utils.py", line 838, in inner
return fn(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/entrypoints/llm.py", line 316, in generate
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/entrypoints/llm.py", line 569, in _run_engine
step_outputs = self.llm_engine.step()
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 911, in step
output = self.model_executor.execute_model(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/executor/distributed_gpu_executor.py", line 70, in execute_model
self.parallel_worker_tasks = self._run_workers(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/executor/ray_xpu_executor.py", line 312, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/worker/worker_base.py", line 383, in execute_method
raise e
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.3+xpu-py3.10.egg/vllm/worker/worker_base.py", line 374, in execute_method
return executor(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
TypeError: WorkerBase.start_worker_execution_loop() got an unexpected keyword argument 'async_run_tensor_parallel_workers_only'
@jikunshang are these issues addressed in https://github.com/vllm-project/vllm/pull/5685?
@jikunshang are these issues addressed in #5685?
yes, I have fixed tensor parallel support issue, please try this PR.
@jikunshang are these issues addressed in #5685?
yes, I have fixed tensor parallel support issue, please try this PR.
I have tested it on my system and it does indeed work with tp>1. Thanks! I hope it can be merged and made available in a future release.
@jikunshang another bit of info. Running llama-2-7b with tensor parallel 2 and 4 works on my system, but on the same system trying to running llama-3-8b with with tp=2 results in this error. Is there anything I should try?
Traceback (most recent call last):
File "/home/raffenet/proj/ipex-vllm/benchmark-scripts/offline_inference.py", line 87, in <module>
llm = LLM(model=args.model,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/entrypoints/llm.py", line 156, in __init__
self.llm_engine = LLMEngine.from_engine_args(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 444, in from_engine_args
engine = cls(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 264, in __init__
self._initialize_kv_caches()
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/engine/llm_engine.py", line 363, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/executor/ray_gpu_executor.py", line 371, in _run_workers
self.driver_worker.execute_method(method, *driver_args,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/worker_base.py", line 382, in execute_method
raise e
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/worker_base.py", line 373, in execute_method
return executor(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/xpu_worker.py", line 129, in determine_num_available_blocks
self.model_runner.profile_run()
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 223, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/worker/xpu_model_runner.py", line 375, in execute_model
hidden_or_intermediate_states = model_executable(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 420, in forward
model_output = self.model(input_ids, positions, kv_caches,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 320, in forward
hidden_states, residual = layer(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 243, in forward
hidden_states = self.self_attn(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/models/llama.py", line 172, in forward
q, k = self.rotary_emb(positions, q, k)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/custom_op.py", line 13, in forward
return self._forward_method(*args, **kwargs)
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/model_executor/layers/rotary_embedding.py", line 243, in forward_xpu
ops.rotary_embedding(positions, query, key, self.head_size,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/vllm-0.5.2+xpu-py3.10.egg/vllm/_ipex_ops.py", line 158, in rotary_embedding
ipex.llm.functional.rotary_embedding(query_rot, key_rot, sin, cos,
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 47, in rotary_embedding
return RotaryEmbedding.apply_function(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 119, in apply_function
query, key = runtime_module.rotary_embedding(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 79, in rotary_embedding
torch.ops.torch_ipex.apply_rotary_embedding_half_qk(
File "/home/raffenet/.conda/envs/vllm/lib/python3.10/site-packages/torch/_ops.py", line 692, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: The size of tensor a (2) must match the size of tensor b (8) at non-singleton dimension 2
@raffenet Thanks for your evaluation. Current ipex kernel do not support GQA yet, which is widely used in latest models like llama3-8b, llama-2-70b. (I am not certain about llama-2-13b) We have verified GQA functionality with an internal version ipex. Next ipex release will fix this issue and should be end of this month.
CC @jgong5 @rogerxfeng8
GQA support for vLLM will be available in coming 2.3.110 IPEX release.
GQA support for vLLM will be available in coming 2.3.110 IPEX release.
Thanks! I look forward to trying it.
Hi, When I run vllm-xpu with Qwen2. I met the same error with GQA.
File "/workspace/vllm/vllm/_ipex_ops.py", line 158, in rotary_embedding
ipex.llm.functional.rotary_embedding(query_rot, key_rot, sin, cos,
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/llm/functional/fusions.py", line 47, in rotary_embedding
return RotaryEmbedding.apply_function(
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/llm/modules/mha_fusion.py", line 119, in apply_function
query, key = runtime_module.rotary_embedding(
File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_pytorch/transformers/models/xpu/fusions/mha_fusion.py", line 79, in rotary_embedding
torch.ops.torch_ipex.apply_rotary_embedding_half_qk(
File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 692, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: The size of tensor a (2) must match the size of tensor b (14) at non-singleton dimension 2
I am wondering when will the 2.3.110 IPEX be released.
Your current environment
Intel GPU info from
sycl-ls
🐛 Describe the bug
offline_inference.py example crashes with
tensor_parallel_size=2
.example output
Manually printing the values being concatenated when the error occurs: