Open Dewei-Wang-sh opened 5 months ago
For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 60% on GPU Max 1550. end2end run with some hack code, but result data mismatch
fixed data mismatch; only collect the perf data that flushes cache. For case fwd_4x48x1024x64_false (batch, num_head, n_ctx, dim_head, causal), it can get 66% on GPU Max 1550.
we aim to get 80%+ of XeTLA use
python/tutorials/06-fused-attention.py
as the test case.912
913
914
915
916
917
1102
1103
1192
(batch head n_ctx d_head causal) on max1100 for fwd_1x2x1024x32_true, xetla median is 4.7tflops for fwd_1x2x1024x32_false, xetla median is 4.6tflops for fwd_4x48x1024x64_true, xetla median is 110tflops for fwd_4x48x1024x64_false, xetla median is 65tflops