vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.77k stars 3.41k forks source link

[Bug]: As V100 does not support FlashAttention, it is not possible to run the gemma model, hopefully it can support the xformers way to run it #6173

Open warlockedward opened 3 weeks ago

warlockedward commented 3 weeks ago

Your current environment

The output of `python collect_env.py`

πŸ› Describe the bug

python3 -m vllm.entrypoints.openai.api_server --model /model/models/gemma-2-27b-it/ --dtype float16 --gpu-memory-utilization 0.98 --dtype float16 --port xxxxxx --tensor-parallel-size 8 --served-model-name gemma-2-27b --disable-custom-all-reduce --disable-sliding-window rank0]: Traceback (most recent call last): rank0: File "", line 198, in _run_module_as_main rank0: File "", line 88, in _run_code rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/entrypoints/openai/api_server.py", line 216, in rank0: engine = AsyncLLMEngine.from_engine_args(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 431, in from_engine_args rank0: engine = cls(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 360, in init rank0: self.engine = self._init_engine(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 507, in _init_engine rank0: return engine_class(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 256, in init

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 353, in _initialize_kv_caches

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 38, in determine_num_available_blocks rank0: num_blocks = self._run_workers("determine_num_available_blocks", )

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 130, in _run_workers rank0: driver_worker_output = driver_worker_method(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/worker.py", line 173, in determine_num_available_blocks

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 874, in profile_run rank0: self.execute_model(model_input, kv_caches, intermediate_tensors) rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context rank0: return func(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/worker/model_runner.py", line 1243, in execute_model rank0: hidden_or_intermediate_states = model_executable(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 336, in forward rank0: hidden_states = self.model(input_ids, positions, kv_caches,

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 277, in forward rank0: hidden_states, residual = layer(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 221, in forward rank0: hidden_states = self.self_attn(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/model_executor/models/gemma2.py", line 162, in forward rank0: attn_output = self.attn(q, k, v, kv_cache, attn_metadata)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl rank0: return forward_call(*args, **kwargs)

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/attention/layer.py", line 94, in forward rank0: return self.impl.forward(query, key, value, kv_cache, attn_metadata,

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm/attention/backends/flashinfer.py", line 260, in forward rank0: output = flash_attn_varlen_func(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 1099, in flash_attn_varlen_func rank0: return FlashAttnVarlenFunc.apply(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/torch/autograd/function.py", line 598, in apply rank0: return super().apply(*args, **kwargs) # type: ignore[misc]

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 596, in forward rank0: out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(

rank0: File "/model/anaconda3/envs/vllm/lib/python3.11/site-packages/vllm_flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward rank0: out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(

rank0: RuntimeError: FlashAttention only supports Ampere GPUs or newer. ERROR 07-06 08:57:19 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 72787 died, exit code: -15 INFO 07-06 08:57:19 multiproc_worker_utils.py:123] Killing local vLLM worker processes /model/anaconda3/envs/vllm/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

LiuXiaoxuanPKU commented 2 weeks ago

You need flashinfer to run gemma2. Unfortunately, currently flashinfer only supports GPU with compute capability >= 8.0 (https://developer.nvidia.com/cuda-gpus).

warlockedward commented 2 weeks ago

You need flashinfer to run gemma2. Unfortunately, currently flashinfer only supports GPU with compute capability >= 8.0 (https://developer.nvidia.com/cuda-gpus).

I looked at the source code for flashinfer and like you said it supports GPU with compute capability >= 8.0, which is a disappointing and sad thing to see

ShinoharaHare commented 2 weeks ago

My understanding is that Gemma 2's attention includes an additional soft-capping operation, which currently seems to be implemented only in Flashinfer. However, it appears that not using soft-capping does not significantly impact inference results. Therefore, is it possible to add an option for users to choose not to perform soft-capping, and consequently not use Flashinfer as the backend?