vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.52k stars 4.43k forks source link

xFormers compatiblity error while installing vLLM for cu118 through pip by reffering the installation docs #2245

Closed skt7 closed 10 months ago

skt7 commented 10 months ago

By following the official docs to setup the vLLM for cu118 through pip, leads to following error with respect to xFormers not compatible with cu118 when serving

INFO 12-23 02:43:27 llm_engine.py:73] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.2+cu121 with CUDA 1201 (you have 2.1.2+cu118)
    Python  3.9.18 (you have 3.9.18)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
MegaBlocks not found. Please install it by `pip install megablocks`.
STK not found: please see https://github.com/stanford-futuredata/stk
Traceback (most recent call last):
  File "/opt/conda/envs/vllm_39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/vllm_39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 495, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 112, in __init__
    self._init_cache()
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 208, in _init_cache
    num_blocks = self._run_workers(
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 750, in _run_workers
    self._run_workers_in_batch(workers, method, *args, **kwargs))
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch
    output = executor(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/worker/worker.py", line 88, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 321, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 279, in execute_model
    hidden_states = self.model(
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 313, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 287, in forward
    return self.decoder(input_ids, positions, kv_caches, input_metadata,
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 259, in forward
    hidden_states = layer(hidden_states, kv_caches[i], input_metadata,
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 164, in forward
    hidden_states = self.self_attn(hidden_states=hidden_states,
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/models/opt.py", line 106, in forward
    attn_output = self.attn(q, k, v, key_cache, value_cache,
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/vllm/model_executor/layers/attention.py", line 154, in forward
    out = xops.memory_efficient_attention_forward(
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/__init__.py", line 244, in memory_efficient_attention_forward
    return _memory_efficient_attention_forward(
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/__init__.py", line 337, in _memory_efficient_attention_forward
    op = _dispatch_fw(inp, False)
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 120, in _dispatch_fw
    return _run_priority_list(
  File "/opt/conda/envs/vllm_39/lib/python3.9/site-packages/xformers/ops/fmha/dispatch.py", line 63, in _run_priority_list
    raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 2048, 12, 64) (torch.float16)
     key         : shape=(1, 2048, 12, 64) (torch.float16)
     value       : shape=(1, 2048, 12, 64) (torch.float16)
     attn_bias   : <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
     p           : 0.0
`decoderF` is not supported because:
    xFormers wasn't build with CUDA support
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
    operator wasn't built - see `python -m xformers.info` for more info
`flshattF@0.0.0` is not supported because:
    xFormers wasn't build with CUDA support
    operator wasn't built - see `python -m xformers.info` for more info
`tritonflashattF` is not supported because:
    xFormers wasn't build with CUDA support
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
    operator wasn't built - see `python -m xformers.info` for more info
    triton is not available
    Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
    xFormers wasn't build with CUDA support
    operator wasn't built - see `python -m xformers.info` for more info
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    xFormers wasn't build with CUDA support
    dtype=torch.float16 (supported: {torch.float32})
    attn_bias type is <class 'xformers.ops.fmha.attn_bias.BlockDiagonalCausalMask'>
    has custom scale
    operator wasn't built - see `python -m xformers.info` for more info
    unsupported embed per head: 64

Focusing more on the main error says:

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.2+cu121 with CUDA 1201 (you have 2.1.2+cu118)

Was able to fix by additionally running:

pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118

Raising a PR for updating the same in the documentation.

skt7 commented 10 months ago

Raised PR here : #2246