[Productize Flash Attention performance #6] optimize slm store/load for attention

quintinwang5 commented 3 months ago

Current Status: On the same test environment: IDC PVC 1550 (1.0.16900.22-914~22.04) SLM vs no SLM(peak): 88.25 TFLOPS vs. 104.96 TFLOPS SLM version adds load_block2d and store.slm which will cause another huge stall peak.

Instruction
  /* [00000528] */         mov (16|M0)              r5.0<4>:w     r53.0<1;1,0>:w                   {$3.dst}             //  ALU pipe: int; $67
is stalled potentially by
  instruction
    /* [00000370] */         load_block2d.ugm.d16.a64.ca.ca (1|M0)  r51:16 [r4:1]        {F@7,$3} // ex_desc:0x0; desc:0x3080203 // $59

Instruction
  /* [00000940] */ (W)     add (1|M0)               r1.0<1>:ud    r2.0<0;1,0>:ud    0x400:uw              {Compacted,$13.src} //  ALU pipe: int; $197
is stalled potentially by
  instruction
    /* [000008F0] */ (W)     store.slm.d64x64t.a32 (1|M0)  [r1:1]    r11:8              {A@7,$13} // ex_desc:0x0; desc:0x200F704 // $196

quintinwang5 commented 2 months ago

Status now: 95.4155TFLOPS(SLM) vs. 104.96 TFLOPS (in the same environment)

quintinwang5 commented 2 months ago

The same with https://github.com/intel/intel-xpu-backend-for-triton/issues/1463

intel / intel-xpu-backend-for-triton

[Productize Flash Attention performance #6] optimize slm store/load for attention #1466