Open ehartford opened 1 year ago
do I need a newer version of cuda maybe? I'm on 11.7 per the guide https://github.com/s4rduk4r/alpaca_lora_4bit_readme/blob/main/README.md
I'm using a 4090 with latest nvidia drivers installed, and I'm using WSL2
I would assume it's the WSL2 integration that is not done correctly, you probably need to run on windows directly or on a linux install.
I had this error when running on a T4, moving to an A100 solved the issue, but I assume RTX4090 support sm80 so I would blame drivers.
its related to this https://github.com/HazyResearch/flash-attention/issues/138
if head size > 64, I wonder if there's a way to set head size to 64
but I don't see an argument for num heads. do you think it would hurt to force it to 64?
It's hardcoded in the model, if you want less attention heads you will need to use the 13b or 7b versions...
Or wait for the upstream fix. May, they say. I guess I'm stuck with 13b for now. Thank you
I'm also on a 4090 and can confirm that it won't do flash attention.
It's not due to the number of heads, though, it's the head size, which is 128 for all three models, so you won't be using flash attention for back propagation with LLaMA on your current GPU, at least with the current version.
It should still help for inference, though I haven't managed to get that working either, just yet. And the only implementation I've found doesn't support caching keys/values, which slows down generation way more than the potential speedup from flash attention.
Back of the bus 😭 Thanks for confirmation
https://github.com/pytorch/pytorch/issues/94883 It looks like it's fixed in pytorch 2.0 so maybe if I update my cuda and pytorch it will work
pytorch/pytorch#94883 It looks like it's fixed in pytorch 2.0 so maybe if I update my cuda and pytorch it will work
Same error with a fresh environment (new user) torch==2.0.0 cuda 11.7 Ubuntu
Running the 30B safetensors (non 128g) from elinas.
Now I'm installing the latest pytorch from github, see if it makes a difference.
Took forever to install pytorch, now I'm getting:
import flash_attn_cuda ImportError: /home/user/.local/lib/python3.9/site-packages/flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs
This was built with cuda 11.7. Now I am attempting to reinstall pytorch, again from GitHub, but this time with cuda 12
Nice let me know if that works and I'll follow your footsteps
Gpt4 says that it's likely the extension (flash attention) that needs to be compiled against your installed version of pytorch and cuda
On Sun, Apr 9, 2023, 8:11 AM juanps90 @.***> wrote:
Took forever to install pytorch, now I'm getting:
import flash_attn_cuda ImportError: /home/user/.local/lib/python3.9/site-packages/ flash_attn_cuda.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN3c105ErrorC2ENS_14SourceLocationESs
This was built with cuda 11.7. Now I am attempting to reinstall pytorch, again from GitHub, but this time with cuda 12
— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62#issuecomment-1501150927, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BLUKBHXFQ2RVFYLYLDXALGRBANCNFSM6AAAAAAWW7WCB4 . You are receiving this because you authored the thread.Message ID: @.***>
Everything updated to the last commit:
RuntimeError: Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
Wow! This means we should be able to train vicuna on 4090 Can you please share your env? versions of everything?
On Sun, Apr 9, 2023, 4:56 PM juanps90 @.***> wrote:
Yep, no longer failing here.
— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62#issuecomment-1501243566, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BL3MRI7WW6TBK23OLTXANEB5ANCNFSM6AAAAAAWW7WCB4 . You are receiving this because you authored the thread.Message ID: @.***>
OK then looks still impossible due to hard coded check upstream unless we patch flash attention to stop checking sm_80
On Sun, Apr 9, 2023, 4:58 PM juanps90 @.***> wrote:
Everything updated to the last commit:
RuntimeError: Expected is_sm80 to be true, but got false. (Could this error message be improved? If so, please report an enhancement request to PyTorch.)
— Reply to this email directly, view it on GitHub https://github.com/johnsmith0031/alpaca_lora_4bit/issues/62#issuecomment-1501244295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIQ4BN4WXTMYRCCPC5FORLXANEMBANCNFSM6AAAAAAWW7WCB4 . You are receiving this because you authored the thread.Message ID: @.***>
Says this PR will be for PT 2.0.1.. And I have no idea where to patch the check. I looked for the error and can't find it. All llama models have >128 heads as stated.
instead of flash attention.. why not try xformers or sdp?
https://github.com/oobabooga/text-generation-webui/pull/950/commits
Added xformers support using the code from the PR above. Tested with it and the result shows that It can slightly reduce VRAM usage.
They say flash attention should work on 4090 https://github.com/HazyResearch/flash-attention/issues/138#issuecomment-1507942577 I'll redouble my efforts.
I am fine-tuning llama 30b 4-bit with my custom dataset (alpaca_clean + leet10k) then I tried to enable flash attention, I use this command line:
python finetune.py --grad_chckpt --flash_attention True --groupsize 128 --cutoff_len 2048 --llama_q4_model ./llama-30b-4bit-128g.safetensors --llama_q4_config_dir ./llama-30b-4bit/ ./leet10k-alpaca-merged.json
I saw this error:
any idea what I'm doing wrong? @yamashi