Open linzm1007 opened 4 months ago
Tesla V100 uses the Volta architecture. It goes Volta < Turing < Ampere < Hopper < Ada
. As of now, flash attention currently supports Ampere, Hopper, or Ada, (also Turing
with flash attention 1.x
). Turing is being worked on for flash attention 2, maybe Volta after that. 🤞
In the meantime, load the model without flash attention.
llama.cpp added FP16 in FlashAttention vector kernel and older GPUs which lacks tensor core can run flash attention. https://github.com/ggerganov/llama.cpp/pull/7061
llama.cpp added FP32 in FlashAttention vector kernel and even Pascal GPUs(which lack FP16 performance) can now run flash attention. https://github.com/ggerganov/llama.cpp/pull/7188
After these changes are imported into 'text-generation-webui', FlashAttention can be supported on non-NVIDIA GPUs (including Apple Silicon) and old pre-Ampere NVIDIA GPUs.
In the meantime, load the model without flash attention.
I can't find a way to load without flash attention:
A month or two ago I tried deleting of flash_attn directory from the venv and even editing some code to skip the check.
Describe the bug
python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code
model : Qwen-7B-Chat
question Traceback (most recent call last): File "/app/modules/callbacks.py", line 61, in gentask ret = self.mfunc(callback=_callback, args, self.kwargs) File "/app/modules/text_generation.py", line 390, in generate_with_callback shared.model.generate(kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate return super().generate( File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate return self.sample( File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample outputs = self( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward transformer_outputs = self.transformer( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward outputs = block( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward attn_outputs = self.attn( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal) File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func return FlashAttnFunc.apply( File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward( File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: FlashAttention only supports Ampere GPUs or newer. Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076) 06:02:32-584051 INFO Deleted "logs/chat/Assistant/20240506-04-07-12.json".
Is there an existing issue for this?
Reproduction
gpu: V100-PCIE-32GB
python: 3.10 model:Qwen-7B-Chat docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code
Screenshot
No response
Logs
System Info