turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

integrate xformers #452

Closed laoda513 closed 1 month ago

laoda513 commented 1 month ago

integrate xformers memory_efficient_attention, could be beneficial if your device's architecture is less than <sm_80 The efficiency between xformers.memory_efficient_attention and flash_attn in >sm_80 are almost the same. But xformer does not expand the kv automatically, we need to do it manually, and the martix operation make this implemention much slower.

So the open logic is, try using flash_attn first and then try using xformer and then using Torch matmul attention.

laoda513 commented 1 month ago

For SM >= 80, we continue using flash_attn. If flash_attn is not available, the xformer implementation is 30-50% slower than flash_attn but 30-50% faster than Torch's matmul attention with very long context(~100K). For SM < 80, the xformer implementation reduces memory cost similarly to SM >= 80 (it actually uses even less memory than SM >= 80, but I'm not sure by how much, so I didn't modify the code related to model loading) and is 30-50% faster than Torch's matmul attention.

Ph0rk0z commented 1 month ago

I had a slight clash with the latest dev that I fixed, but seems to be working on P100 and 2080s. It is a good fallback when you inference on mixed cards that don't support FA. Can cram bigger wizard now. Yes processing longer contexts takes more time but that's better than being unable to load/use the model.