turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

ROCm Flash-Attention 2 #397

Open nktice opened 2 months ago

nktice commented 2 months ago

I have been informed that while Flash Attention's there it's not being used - https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-2031180332 The post has a link to what has helped some people, so I'll link that here : https://github.com/oobabooga/text-generation-webui/issues/3759#issuecomment-1889069311 Essentially they were adjusting version checks to get it to work... I've tried the same change, and can not get it working, so I thought I'd write, and raise the issue here, in hopes that it may help others with the same issue.

AMD's version of Flash Attention 2 is 2.0.4 - have any insights about what needs to happen to get it to work?

turboderp commented 2 months ago

flash-attn introduced a crucial change in 2.1.0 without which it's really kind of useless for generating text. Before this it only worked with k_len = q_len or q_len = 1, ruling out features like cache reuse, speculative decoding and chunked prefill. ExLlama used to have some workarounds, but they were problematic and mostly just ended up disabling flash-attn anyway.

So I would say supporting 2.0.4 is hard. 2.1.0 should be possible (although it checks for version 2.2.1 at the moment.)