WIP: FlashAttention for WebGPU EP

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

14.77k stars 2.94k forks source link

WIP: FlashAttention for WebGPU EP #22919

Open sushraja-msft opened 9 hours ago

sushraja-msft commented 9 hours ago

WIP: Implementation of FlashAttention that works for MHA

Currently only works on machines where the subgroup size is the same as tile size. (Intel)
Works only for the condition of new sequence length is 1.

The other scenarios require more debugging, algorithm needs optimization as well for the 1 seq length case because workgroups are left unused in how ComputeDotProduct is invoked.