Closed Dewei-Wang-sh closed 1 month ago
Current Status: On the same test environment: IDC PVC 1550 (1.0.16900.22-914~22.04) SLM vs no SLM(peak): 88.25 TFLOPS vs. 104.96 TFLOPS SLM version adds load_block2d and store.slm which will cause another huge stall peak.
SLM
no SLM
Instruction /* [00000528] */ mov (16|M0) r5.0<4>:w r53.0<1;1,0>:w {$3.dst} // ALU pipe: int; $67 is stalled potentially by instruction /* [00000370] */ load_block2d.ugm.d16.a64.ca.ca (1|M0) r51:16 [r4:1] {F@7,$3} // ex_desc:0x0; desc:0x3080203 // $59
Instruction /* [00000940] */ (W) add (1|M0) r1.0<1>:ud r2.0<0;1,0>:ud 0x400:uw {Compacted,$13.src} // ALU pipe: int; $197 is stalled potentially by instruction /* [000008F0] */ (W) store.slm.d64x64t.a32 (1|M0) [r1:1] r11:8 {A@7,$13} // ex_desc:0x0; desc:0x200F704 // $196
Status now: 95.4155TFLOPS(SLM) vs. 104.96 TFLOPS (in the same environment)
The same with https://github.com/intel/intel-xpu-backend-for-triton/issues/1463
Current Status: On the same test environment: IDC PVC 1550 (1.0.16900.22-914~22.04)
SLM
vsno SLM
(peak): 88.25 TFLOPS vs. 104.96 TFLOPS SLM version adds load_block2d and store.slm which will cause another huge stall peak.