Open innat opened 5 months ago
Describe
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Paper https://arxiv.org/abs/2205.14135 Cited by: 671
Implementation
Huggingface https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention
Others
Has version 2 of it.
https://arxiv.org/abs/2307.08691
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
For JAX, we may want to rely on Pallas. For TF, since we can't rely on custom ops, we may have to skip support.
Presumably we should add it in the form of a new backend op, ops.nn.flash_attention.
ops.nn.flash_attention
Describe
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.
Paper https://arxiv.org/abs/2205.14135 Cited by: 671
Implementation
Huggingface https://huggingface.co/docs/text-generation-inference/en/conceptual/flash_attention
Others
Has version 2 of it.
https://arxiv.org/abs/2307.08691
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning