intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
126 stars 36 forks source link

[Attention Performance] Flash Attention performance get to 80%~90% of XeTLA #773

Open Dewei-Wang-sh opened 5 months ago

Dewei-Wang-sh commented 5 months ago

we aim to get 80%+ of XeTLA use python/tutorials/06-fused-attention.py as the test case.

(batch head n_ctx d_head causal) on max1100 for fwd_1x2x1024x32_true, xetla median is 4.7tflops for fwd_1x2x1024x32_false, xetla median is 4.6tflops for fwd_4x48x1024x64_true, xetla median is 110tflops for fwd_4x48x1024x64_false, xetla median is 65tflops

Dewei-Wang-sh commented 4 months ago

For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 60% on GPU Max 1550. end2end run with some hack code, but result data mismatch

Dewei-Wang-sh commented 4 months ago

fixed data mismatch; only collect the perf data that flushes cache. For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 66% on GPU Max 1550.