using flash attention, RuntimeError: Expected is_sm80 to be true, but got false.

ehartford commented 1 year ago

I am fine-tuning llama 30b 4-bit with my custom dataset (alpaca_clean + leet10k) then I tried to enable flash attention, I use this command line:

python finetune.py --grad_chckpt --flash_attention True --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ ./leet10k-alpaca-merged.json

I saw this error:

Traceback (most recent call last):
  File "/home/eric/git/alpaca_lora_4bit/finetune.py", line 156, in <module>
    trainer.train()
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/transformers/trainer.py", line 2709, in training_step
    self.scaler.scale(loss).backward()
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 77, in backward
    _flash_attn_backward(
  File "/home/eric/miniconda3/envs/newft/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 42, in _flash_attn_backward
    _, _, _, softmax_d = flash_attn_cuda.bwd(
RuntimeError: Expected is_sm80 to be true, but got false.

any idea what I'm doing wrong? @yamashi

ehartford commented 1 year ago

do I need a newer version of cuda maybe? I'm on 11.7 per the guide https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md

ehartford commented 1 year ago

I'm using a 4090 with latest nvidia drivers installed, and I'm using WSL2

maximegmd commented 1 year ago

I would assume it's the WSL2 integration that is not done correctly, you probably need to run on windows directly or on a linux install.

I had this error when running on a T4, moving to an A100 solved the issue, but I assume RTX4090 support sm80 so I would blame drivers.

ehartford commented 1 year ago

https://github.com/HazyResearch/flash-attention/issues/138#issuecomment-1466354480

ehartford commented 1 year ago

if head size > 64, I wonder if there's a way to set head size to 64

ehartford commented 1 year ago

https://github.com/johnsmith0031/alpaca_lora_4bit/blob/f91d4cbb593b097f5dfb60866a04e90044414da6/monkeypatch/llama_flash_attn_monkey_patch.py#L23

but I don't see an argument for num heads. do you think it would hurt to force it to 64?

maximegmd commented 1 year ago

It's hardcoded in the model, if you want less attention heads you will need to use the 13b or 7b versions...

ehartford commented 1 year ago

Or wait for the upstream fix. May, they say. I guess I'm stuck with 13b for now. Thank you

turboderp commented 1 year ago

I'm also on a 4090 and can confirm that it won't do flash attention.

It's not due to the number of heads, though, it's the head size, which is 128 for all three models, so you won't be using flash attention for back propagation with LLaMA on your current GPU, at least with the current version.

It should still help for inference, though I haven't managed to get that working either, just yet. And the only implementation I've found doesn't support caching keys/values, which slows down generation way more than the potential speedup from flash attention.

ehartford commented 1 year ago

Back of the bus 😭 Thanks for confirmation

ehartford commented 1 year ago

https://github.com/pytorch/pytorch/issues/94883 It looks like it's fixed in pytorch 2.0 so maybe if I update my cuda and pytorch it will work

juanps90 commented 1 year ago

pytorch/pytorch#94883 It looks like it's fixed in pytorch 2.0 so maybe if I update my cuda and pytorch it will work

Same error with a fresh environment (new user) torch==2.0.0 cuda 11.7 Ubuntu

Running the 30B safetensors (non 128g) from elinas.

Now I'm installing the latest pytorch from github, see if it makes a difference.

juanps90 commented 1 year ago

Took forever to install pytorch, now I'm getting:

import flash_attn_cuda ImportError: /home/user/.local/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

This was built with cuda 11.7. Now I am attempting to reinstall pytorch, again from GitHub, but this time with cuda 12

ehartford commented 1 year ago

Nice let me know if that works and I'll follow your footsteps

ehartford commented 1 year ago

Gpt4 says that it's likely the extension (flash attention) that needs to be compiled against your installed version of pytorch and cuda

On Sun, Apr 9, 2023, 8:11 AM juanps90 @.***> wrote:

Took forever to install pytorch, now I'm getting:

import flash_attn_cuda ImportError: /home/user/.local/lib/python3.9/site-packages/ flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs

This was built with cuda 11.7. Now I am attempting to reinstall pytorch, again from GitHub, but this time with cuda 12

— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62#issuecomment-1501150927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BLUKBHXFQ2RVFYLYLDXALGRBANCNFSM6AAAAAAWW7WCB4 . You are receiving this because you authored the thread.Message ID: @.***>

juanps90 commented 1 year ago

Everything updated to the last commit:

RuntimeError: Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

ehartford commented 1 year ago

Wow! This means we should be able to train vicuna on 4090 Can you please share your env? versions of everything?

On Sun, Apr 9, 2023, 4:56 PM juanps90 @.***> wrote:

Yep, no longer failing here.

— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62#issuecomment-1501243566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BL3MRI7WW6TBK23OLTXANEB5ANCNFSM6AAAAAAWW7WCB4 . You are receiving this because you authored the thread.Message ID: @.***>

ehartford commented 1 year ago

OK then looks still impossible due to hard coded check upstream unless we patch flash attention to stop checking sm_80

On Sun, Apr 9, 2023, 4:58 PM juanps90 @.***> wrote:

Everything updated to the last commit:

RuntimeError: Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)

— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62#issuecomment-1501244295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BN4WXTMYRCCPC5FORLXANEMBANCNFSM6AAAAAAWW7WCB4 . You are receiving this because you authored the thread.Message ID: @.***>

Ph0rk0z commented 1 year ago

Says this PR will be for PT 2.0.1.. And I have no idea where to patch the check. I looked for the error and can't find it. All llama models have >128 heads as stated.

instead of flash attention.. why not try xformers or sdp?

https://github.com/oobabooga/text-generation-webui/pull/950/commits

johnsmith0031 commented 1 year ago

Added xformers support using the code from the PR above. Tested with it and the result shows that It can slightly reduce VRAM usage.

ehartford commented 1 year ago

They say flash attention should work on 4090 https://github.com/HazyResearch/flash-attention/issues/138#issuecomment-1507942577 I'll redouble my efforts.

johnsmith0031 / alpaca_lora_4bit

using flash attention, RuntimeError: Expected is_sm80 to be true, but got false. #62