turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 272 forks source link

Add flash attention to requirements.txt #367

Closed Anthonyg5005 closed 6 months ago

Anthonyg5005 commented 6 months ago

I'd send in a PR but I wouldn't want to break anything. I was wondering why nothing was happening after reinstall of windows until I realized flash attention wasn't in the requirements. after installing flash attention from text-generation-webui/requirements.txt I was able to get it to work and start converting. Only problem I see is having to update it on every flash attention release. Another solution could be just adding it to the README instructions.

Love your work turbo.

turboderp commented 6 months ago

It's supposed to be able to run without flash-attn, which is why it isn't in the requirements. Could be something broke and I didn't notice?

Anthonyg5005 commented 6 months ago

I'm not really sure. I might've not waited long enough. I gave it like 5 minutes before looking around for potential issues. To be more specific I was quantizing a model fine-tuned on mistral 7b v0.1 I have a 12GB 3060, Driver Version: 551.61, CUDA Version: 12.4 Windows 11 Build 22631 Python 11, Torch 2.2.1+cu121 It only started converting after installing flash attention. I also just installed windows temporarily while I try to fix my main drive. Installed VS 2019 and added the bin with cl.exe to path

Anthonyg5005 commented 6 months ago

I removed flash attention and it still works. The real issue was forgetting to use pip install . after installing requirements so it had to do build for a while first before beginning.