By following the official docs to setup the vLLM for cu118 through pip, leads to following error with respect to xFormers not compatible with cu118 when serving
INFO 12-23 02:43:27 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.2+cu121 with CUDA 1201 (you have 2.1.2+cu118)
Python 3.9.18 (you have 3.9.18)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
MegaBlocks not found. Please install it by `pip install megablocks`.
STK not found: please see https://github.com/stanford-futuredata/stk
Traceback (most recent call last):
File "/opt/conda/envs/vllm_39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/vllm_39/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
return engine_class(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 112, in __init__
self._init_cache()
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 208, in _init_cache
num_blocks = self._run_workers(
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
self._run_workers_in_batch(workers, method, *args, **kwargs))
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch
output = executor(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/worker/worker.py", line 88, in profile_num_available_blocks
self.model_runner.profile_run()
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 321, in profile_run
self.execute_model(seqs, kv_caches)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 279, in execute_model
hidden_states = self.model(
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 313, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 287, in forward
return self.decoder(input_ids, positions, kv_caches, input_metadata,
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 259, in forward
hidden_states = layer(hidden_states, kv_caches[i], input_metadata,
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 164, in forward
hidden_states = self.self_attn(hidden_states=hidden_states,
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 106, in forward
attn_output = self.attn(q, k, v, key_cache, value_cache,
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/layers/attention.py", line 154, in forward
out = xops.memory_efficient_attention_forward(
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/__init__.py", line 244, in memory_efficient_attention_forward
return _memory_efficient_attention_forward(
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/__init__.py", line 337, in _memory_efficient_attention_forward
op = _dispatch_fw(inp, False)
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 120, in _dispatch_fw
return _run_priority_list(
File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 63, in _run_priority_list
raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
query : shape=(1, 2048, 12, 64) (torch.float16)
key : shape=(1, 2048, 12, 64) (torch.float16)
value : shape=(1, 2048, 12, 64) (torch.float16)
attn_bias : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
p : 0.0
`decoderF` is not supported because:
xFormers wasn't build with CUDA support
attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
operator wasn't built - see `python -m xformers.info` for more info
`flshattF@0.0.0` is not supported because:
xFormers wasn't build with CUDA support
operator wasn't built - see `python -m xformers.info` for more info
`tritonflashattF` is not supported because:
xFormers wasn't build with CUDA support
attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
operator wasn't built - see `python -m xformers.info` for more info
triton is not available
Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
xFormers wasn't build with CUDA support
operator wasn't built - see `python -m xformers.info` for more info
`smallkF` is not supported because:
max(query.shape[-1] != value.shape[-1]) > 32
xFormers wasn't build with CUDA support
dtype=torch.float16 (supported: {torch.float32})
attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
has custom scale
operator wasn't built - see `python -m xformers.info` for more info
unsupported embed per head: 64
Focusing more on the main error says:
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.2+cu121 with CUDA 1201 (you have 2.1.2+cu118)
By following the official docs to setup the vLLM for cu118 through pip, leads to following error with respect to xFormers not compatible with cu118 when serving
Focusing more on the main error says:
Was able to fix by additionally running:
Raising a PR for updating the same in the documentation.