[Bug]: benchmark_throughput gets TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt' wit CPU

LGLG42 commented 2 months ago

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.0
Libc version: glibc-2.35

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-91-generic-x86_64-with-glibc2.35

...

[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] transformers              4.42.3                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

Running benchmark_throughput.py ... --device cpu throws an exception, it works with GPU.

VLLM_CPU_KVCACHE_SPACE=16 time python benchmarks/benchmark_throughput.py --model mosaicml/mpt-7b --input-len 128 --output-len 512 --trust-remote-code --backend=vllm  --device cpu --dtype bfloat16
...
WARNING 07-08 21:21:47 cpu_executor.py:119] CUDA graph is not supported on CPU, fallback to the eager mode.
INFO 07-08 21:21:48 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 21:21:48 selector.py:53] Using XFormers backend.
INFO 07-08 21:21:49 selector.py:191] Cannot use FlashAttention-2 backend because the vllm_flash_attn package is not found. `pip install vllm-flash-attn` for better performance.
INFO 07-08 21:21:49 selector.py:53] Using XFormers backend.
INFO 07-08 21:21:50 weight_utils.py:218] Using model weights format ['*.bin']
pytorch_model-00002-of-00002.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3.36G/3.36G [00:32<00:00, 103MB/s]
pytorch_model-00001-of-00002.bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.94G/9.94G [01:18<00:00, 126MB/s]
INFO 07-08 21:23:44 cpu_executor.py:72] # CPU blocks: 2048████████████████████████████████████████████████████████████████████████████████████████████████████████▊| 9.93G/9.94G [01:18<00:00, 127MB/s]
Processed prompts:   0%|                                                                                                  | 0/1000 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]: Traceback (most recent call last):
[rank0]:   File "/nfs_users/users/luis.lorenzo/code/vllm/benchmarks/benchmark_throughput.py", line 439, in <module>
[rank0]:     main(args)
[rank0]:   File "/nfs_users/users/luis.lorenzo/code/vllm/benchmarks/benchmark_throughput.py", line 227, in main
[rank0]:     elapsed_time = run_vllm(
[rank0]:                    ^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/code/vllm/benchmarks/benchmark_throughput.py", line 127, in run_vllm
[rank0]:     llm.generate(prompts, sampling_params, use_tqdm=True)
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/utils.py", line 795, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 309, in generate
[rank0]:     outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/entrypoints/llm.py", line 561, in _run_engine
[rank0]:     step_outputs = self.llm_engine.step()
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 861, in step
[rank0]:     output = self.model_executor.execute_model(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/executor/cpu_executor.py", line 78, in execute_model
[rank0]:     output = self.driver_worker.execute_model(execute_model_req)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 235, in execute_model
[rank0]:     self.model_runner.prepare_model_input(
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/worker/cpu_model_runner.py", line 327, in prepare_model_input
[rank0]:     ) = self._prepare_prompt(seq_group_metadata_list)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/worker/cpu_model_runner.py", line 202, in _prepare_prompt
[rank0]:     attn_metadata = self.attn_backend.make_metadata(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/nfs_users/users/luis.lorenzo/anaconda3/envs/vllm_311/lib/python3.11/site-packages/vllm/attention/backends/abstract.py", line 29, in make_metadata
[rank0]:     return cls.get_metadata_cls()(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt'
Processed prompts:   0%|                                                                                                  | 0/1000 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Command exited with non-zero status 1

Originally I got this error with flash attention as attn backend, but then I played with different versions of vllm and flash, and eventually uninstalled it, and same issue. Error was: TypeError: FlashAttentionMetadata.__init__() got an unexpected keyword argument 'is_prompt'

I've also tried with other vllm-enabled models, same issue. I've pulled latest main to get updated version of /benchmark_throughput.py, same issue.

My main guess is atm that cpu_model_runner.py and model_runner.py have diverged when calling "attn_metadata = self.attn_backend.make_metadata(" , somehow, somewhere the "is_prompt" kwarg was removed for GPU but not for CPU. I've looked a bit at the code but does not seem to be a trivial fix, so I'll let someone with more experience/time to look into this.

husimplicity commented 1 month ago

Encountered the same issue for CPU when running VLLM_CPU_KVCACHE_SPACE=48 python offline_inference.py with the model Qwen2-72B: [rank0]: TypeError: FlashAttentionMetadata.init() got an unexpected keyword argument 'is_prompt'

dnesting-usa commented 1 month ago

I'm not positive but I believe this was introduced in https://github.com/vllm-project/vllm/commit/65bf2ac165734fb6339210c4b2b8ce68d2391b77#diff-f81ae2354d4f46c8591a009289e97ab465771d5593feb114539f7cdc58486663L159

Looks like some refactoring occurred here, removing is_prompt but missing an occurrence in cpu_model_runner.py. I found this issue seeing CPU inferencing failing in recent versions of vllm. Reverting to 0.4.2 (before this change) may fix the issue.

cc @rkooo567 as the author of that change

anencore94 commented 2 weeks ago

I've encountered same issue. when I run following command in vllm gpu backend.

VLLM_CPU_KVCACHE_SPACE=16 python benchmarks/benchmark_throughput.py --model mosaicml/mpt-7b --input-len 128 --output-len 512 --trust-remote-code --backend=vllm  --device cpu --dtype bfloat16

with --device cpu

my vllm environments was built from source with editable package mode from the main branch.(pip install -e .)

Here's what I've found.

Problem: Mismatch Between `vllm` Package and Device Type in Backend Selection

The issue arises because the vllm package has separate Python packages for GPU and CPU versions.
In the execution flow (LLMEngine -> _get_executor_cls -> CPUExecutor -> CPUWorker -> CPU Model Runner -> get_attn_backend), the get_attn_backend function determines the backend based on whether the vllm package is compiled for CPU or GPU, rather than checking the runtime device_type.
This causes the incorrect attention backend to be selected if the vllm package is GPU-based, even when the CPUExecutor is being used.

Proposed Solution

It seems that if the vllm package is GPU-based, the CPUExecutor should not be invoked at all. But I'm not sure when to raise an error if we should do.
Should _get_executor_cls raise an exception when utils.is_cpu, utils.is_openvino, etc., do not match the expected device type?
What’s the best approach to handle this mismatch? I'm not sure, but I could run an api_server with device=cpu even the vllm package was not built as cpu backend. So I think we could select another attention backend to support 'device_type'.

WDYT @WoosukKwon ?

khayamgondal commented 21 hours ago

Facing the same issue on aarch64 GH200. @anencore94 did you find a solution? I just want to run benchmark_throughput with cpu

vllm-project / vllm