oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.54k stars 5.2k forks source link

RuntimeError: FlashAttention only supports Ampere GPUs or newer. #5985

Open linzm1007 opened 4 months ago

linzm1007 commented 4 months ago

Describe the bug

image image

python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code

model : Qwen-7B-Chat

question Traceback (most recent call last): File "/app/modules/callbacks.py", line 61, in gentask ret = self.mfunc(callback=_callback, args, self.kwargs) File "/app/modules/text_generation.py", line 390, in generate_with_callback shared.model.generate(kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate return super().generate( File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate return self.sample( File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample outputs = self( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward transformer_outputs = self.transformer( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward outputs = block( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward attn_outputs = self.attn( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal) File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func return FlashAttnFunc.apply( File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward( File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: FlashAttention only supports Ampere GPUs or newer. Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076) 06:02:32-584051 INFO Deleted "logs/chat/Assistant/20240506-04-07-12.json".

Is there an existing issue for this?

Reproduction

gpu: V100-PCIE-32GB
python: 3.10 model:Qwen-7B-Chat docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/app/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/app/modules/text_generation.py", line 390, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate
    return super().generate(
  File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
    outputs = self(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward
    outputs = block(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward
    attn_outputs = self.attn(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward
    attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward
    output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
    return FlashAttnFunc.apply(
  File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076)
06:02:32-584051 INFO     Deleted "logs/chat/Assistant/20240506-04-07-12.json".

System Info

![image](https://github.com/oobabooga/text-generation-webui/assets/96732179/3b7f46e8-d59b-4d56-bfb3-95c7ecd73887)
![image](https://github.com/oobabooga/text-generation-webui/assets/96732179/4232979d-e104-4117-bca5-15edd7787617)
nickpotafiy commented 4 months ago

Tesla V100 uses the Volta architecture. It goes Volta < Turing < Ampere < Hopper < Ada. As of now, flash attention currently supports Ampere, Hopper, or Ada, (also Turing with flash attention 1.x). Turing is being worked on for flash attention 2, maybe Volta after that. 🤞

In the meantime, load the model without flash attention.

ghts commented 3 months ago

llama.cpp added FP16 in FlashAttention vector kernel and older GPUs which lacks tensor core can run flash attention. https://github.com/ggerganov/llama.cpp/pull/7061

llama.cpp added FP32 in FlashAttention vector kernel and even Pascal GPUs(which lack FP16 performance) can now run flash attention. https://github.com/ggerganov/llama.cpp/pull/7188

After these changes are imported into 'text-generation-webui', FlashAttention can be supported on non-NVIDIA GPUs (including Apple Silicon) and old pre-Ampere NVIDIA GPUs.

the242 commented 1 month ago

In the meantime, load the model without flash attention.

I can't find a way to load without flash attention:

screenshot

A month or two ago I tried deleting of flash_attn directory from the venv and even editing some code to skip the check.