Flash attention and flash decoding principles

ml-explore / mlx

MLX: An array framework for Apple silicon

https://ml-explore.github.io/mlx/

MIT License

15.72k stars 895 forks source link

Flash attention and flash decoding principles #129

Open RonanKMcGovern opened 6 months ago

RonanKMcGovern commented 6 months ago

Are there plans to add flash attention and also flash decoding to allow for improved performance for long context?

awni commented 6 months ago

We'd love to have this. Our first priority is quantization but when we have the bandwidth we can look into adding Flash attention. (Note PRs are welcome)

BuildBackBuehler commented 6 months ago

We'd love to have this. Our first priority is quantization but when we have the bandwidth we can look into adding Flash attention. (Note PRs are welcome)

I'd messaged the maintainer of this project a few days ago because it seemed like he's dedicated to it and I saw he wanted to implement it in 1 or 2 other projects. But in case you can't get ahold of 'em, you have the link ¯_(ツ)_/¯

ivanfioravanti commented 2 months ago

This would be amazing! So that we can have integration in the amazing axolotl!