turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 235 forks source link

expose xformers? #364

Closed Ph0rk0z closed 2 weeks ago

Ph0rk0z commented 3 months ago

Tested out your xformers attention implementation. On 2080ti 22g, I am fitting 1000+ more tokens on nous-capybara. It supports P100 when compiled and would probably help those with non FA cards vs having nothing. Haven't tried SDP yet but I'm guessing it did worse? I only did Q8 cache, maybe more will fit with Q4.

edit:

I was able to test both SDP and xformers but wasn't paying attention to the outputs, just OOM. For some reason I can't get the model coherent. It's probably due to having to reshape the tensors. Xformers was faster.

brucethemoose commented 3 months ago

Are you referencing this block here?

https://github.com/turboderp/exllamav2/blob/0fbd108e519047f33a91a7daddb46dee6be5e2c9/exllamav2/attn.py#L480

Ph0rk0z commented 3 months ago

Yea. I transposed K+V to get it to inference but no go. Just get gibberish. Same for SDP. Must be missing something. xformers seemed like the inference got faster too. SDP it was the same. These functions expect q/k/v to all be the same size, for flash attention it takes care of that.

brucethemoose commented 3 months ago

Any thoughts on flash infer?

https://github.com/flashinfer-ai/flashinfer

Looks like it has the same expectation, just looking at the example.

Ph0rk0z commented 3 months ago

Doesn't it have it's own kernels? I dunno if it can be shoehorned in but I didn't look too hard. Benefit of xformers is working on non-ampere cards.

turboderp commented 2 weeks ago

xformers support has been added. For what it's worth. (:

Ph0rk0z commented 2 weeks ago

The latest git xformers changed where you import the mask from so that will be coming up.

It's xformers.ops.fmha.attn_bias. Found out when I rebuilt it for torch 2.3.1