[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192

intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs

MIT License

143 stars 44 forks source link

[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192 #2442

Closed whitneywhtsang closed 1 week ago

whitneywhtsang commented 1 month ago

There are currently 3 shapes (causal=false, d_head=64) that have performance <95% of XeTLA.

32, 32, 512: 78%
4, 32, 4096: 82%
2, 32, 8192: 93%

quintinwang5 commented 3 weeks ago

Huge SBIDStalls in Triton kernel. Also more DistStalls in Triton kernel(the red line) WIP to investigate and finetune.

quintinwang5 commented 2 weeks ago

Status: Did lots of finetunes and experiments. Most finetune configs are harmful to the performance. Modified llir by hand and execute it, there is also no big performance gains(ex. While dpas insts are atomic, we'll have more spills. It will offset the benefit from atomic dpas). The reason is higher SBIDStall and DistStall for load/sync insts than XETLA. This shows we are in short of registers. We may need helps from XETLA team or others.

quintinwang5 commented 2 weeks ago

Cache hit of XETLA: Cache hit of Triton: Triton with PR:

quintinwang5 commented 2 weeks ago

Overall status:

quintinwang5 commented 1 week ago

According to the new overall status:

32, 32, 512 , 64, False: 78% --> 92.99%
4, 32, 4096, 64, False: 82% --> 96.17% (not by this issue)
2, 32, 8192, 64, Flase: 93% --> 97.85 (not by this issue)

extra N_CTX=512 shapes:

32, 16, 512, 128, False: 66.41% --> 80.91%
32, 16, 512, 128, True: 60.10% --> 80.26%
32, 32, 512, 64, True: 65.91% --> 88.44%

Still need some works to bring 92.99% up to 95%+.

tdeng5 commented 1 week ago

What did we change to make the performance improvement?

quintinwang5 commented 1 week ago

What did we change to make the performance improvement?

I change the grid info (global_range in SYCL kernel submit). The main idea is to keep the splited-M axis align with the num_warps * threads_per_warp axis. For 32, 32, 512, 64, False case, the range will change from {{32, 32, 4}, {128, 1, 1}} to {{4, 32, 32}, {128, 1, 1}}(in CUDA style), which 4 = N_CTX / BLOCK_M = 512 / 128. This change will benefit our L3 cache hit greatly(about 10x better) for N_CTX=512 cases, but is harmful to some other cases(especially causal=True, small batch cases observed). The detail mechnism is under investigation.