turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

FlashAttention-2, 2x faster than FlashAttention #161

Closed nikshepsvn closed 1 year ago

nikshepsvn commented 1 year ago

https://twitter.com/tri_dao/status/1680987580228308992