oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.9k stars 5.34k forks source link

RuntimeError: FlashAttention only supports Ampere GPUs or newer. #5985

Open linzm1007 opened 6 months ago

linzm1007 commented 6 months ago

Describe the bug

image image

python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code

model : Qwen-7B-Chat

question Traceback (most recent call last): File "/app/modules/callbacks.py", line 61, in gentask ret = self.mfunc(callback=_callback, args, self.kwargs) File "/app/modules/text_generation.py", line 390, in generate_with_callback shared.model.generate(kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate return super().generate( File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate return self.sample( File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample outputs = self( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(*args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward transformer_outputs = self.transformer( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward outputs = block( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward attn_outputs = self.attn( File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward output = module._old_forward(args, kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal) File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func return FlashAttnFunc.apply( File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward( File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd( RuntimeError: FlashAttention only supports Ampere GPUs or newer. Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076) 06:02:32-584051 INFO Deleted "logs/chat/Assistant/20240506-04-07-12.json".

Is there an existing issue for this?

Reproduction

gpu: V100-PCIE-32GB
python: 3.10 model:Qwen-7B-Chat docker docker run -it --rm --gpus='"device=0,3"' -v /root/wangbing/model/Qwen-7B-Chat/V1/:/data/mlops/modelDir -v /root/wangbing/sftmodel/qwen/V1:/data/mlops/adapterDir/ -p30901:5000 -p7901:7860 dggecr01.huawei.com:80/tbox/text-generation-webui:at-0.0.1 bash app python server.py --auto-devices --gpu-memory 80 80 --listen --api --model-dir /data/mlops/ --model modelDir --trust-remote-code

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/app/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/app/modules/text_generation.py", line 390, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1259, in generate
    return super().generate(
  File "/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
    return self.sample(
  File "/venv/lib/python3.10/site-packages/transformers/generation/utils.py", line 2696, in sample
    outputs = self(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 891, in forward
    outputs = block(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 610, in forward
    attn_outputs = self.attn(
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
    output = module._old_forward(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 499, in forward
    attn_output = self.core_attention_flash(q, k, v, attention_mask=attention_mask)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/modelDir/modeling_qwen.py", line 191, in forward
    output = flash_attn_func(q, k, v, dropout_p, softmax_scale=self.softmax_scale, causal=self.causal)
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 825, in flash_attn_func
    return FlashAttnFunc.apply(
  File "/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 507, in forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_forward(
  File "/venv/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 51, in _flash_attn_forward
    out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.
Output generated in 2.02 seconds (0.00 tokens/s, 0 tokens, context 60, seed 1536745076)
06:02:32-584051 INFO     Deleted "logs/chat/Assistant/20240506-04-07-12.json".

System Info

![image](https://github.com/oobabooga/text-generation-webui/assets/96732179/3b7f46e8-d59b-4d56-bfb3-95c7ecd73887)
![image](https://github.com/oobabooga/text-generation-webui/assets/96732179/4232979d-e104-4117-bca5-15edd7787617)
nickpotafiy commented 6 months ago

Tesla V100 uses the Volta architecture. It goes Volta < Turing < Ampere < Hopper < Ada. As of now, flash attention currently supports Ampere, Hopper, or Ada, (also Turing with flash attention 1.x). Turing is being worked on for flash attention 2, maybe Volta after that. 🤞

In the meantime, load the model without flash attention.

ghts commented 5 months ago

llama.cpp added FP16 in FlashAttention vector kernel and older GPUs which lacks tensor core can run flash attention. https://github.com/ggerganov/llama.cpp/pull/7061

llama.cpp added FP32 in FlashAttention vector kernel and even Pascal GPUs(which lack FP16 performance) can now run flash attention. https://github.com/ggerganov/llama.cpp/pull/7188

After these changes are imported into 'text-generation-webui', FlashAttention can be supported on non-NVIDIA GPUs (including Apple Silicon) and old pre-Ampere NVIDIA GPUs.

the242 commented 4 months ago

In the meantime, load the model without flash attention.

I can't find a way to load without flash attention:

screenshot

A month or two ago I tried deleting of flash_attn directory from the venv and even editing some code to skip the check.

NatRusso commented 4 days ago

I'm having the same issue right now with an error message calling out FlashAttention, and ChatGPT recommended the same thing...run the server without FlashAttention. But, for the life of me, I can't find a way to actually do this.