intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

[FA] Improve performance of shapes <95% on advanced path - 32x32x512, 4x32x4096, 2x32x8192 #2442

Closed whitneywhtsang closed 1 week ago

whitneywhtsang commented 1 month ago

There are currently 3 shapes (causal=false, d_head=64) that have performance <95% of XeTLA.

quintinwang5 commented 3 weeks ago

Huge SBIDStalls in Triton kernel. image image Also more DistStalls in Triton kernel(the red line) image image WIP to investigate and finetune.

quintinwang5 commented 2 weeks ago

Status: Did lots of finetunes and experiments. Most finetune configs are harmful to the performance. Modified llir by hand and execute it, there is also no big performance gains(ex. While dpas insts are atomic, we'll have more spills. It will offset the benefit from atomic dpas). The reason is higher SBIDStall and DistStall for load/sync insts than XETLA. This shows we are in short of registers. We may need helps from XETLA team or others.

quintinwang5 commented 2 weeks ago

Cache hit of XETLA: image Cache hit of Triton: image Triton with PR: image

quintinwang5 commented 2 weeks ago

Overall status: image

quintinwang5 commented 1 week ago

According to the new overall status:

extra N_CTX=512 shapes:

Still need some works to bring 92.99% up to 95%+.

tdeng5 commented 1 week ago

What did we change to make the performance improvement?

quintinwang5 commented 1 week ago

What did we change to make the performance improvement?

I change the grid info (global_range in SYCL kernel submit). The main idea is to keep the splited-M axis align with the num_warps * threads_per_warp axis. For 32, 32, 512, 64, False case, the range will change from {{32, 32, 4}, {128, 1, 1}} to {{4, 32, 32}, {128, 1, 1}}(in CUDA style), which 4 = N_CTX / BLOCK_M = 512 / 128. This change will benefit our L3 cache hit greatly(about 10x better) for N_CTX=512 cases, but is harmful to some other cases(especially causal=True, small batch cases observed). The detail mechnism is under investigation.