Closed Dewei-Wang-sh closed 4 months ago
the ultimate goal is to make gemm4Kx4Kx4K perf up to 80% of XeTLA. will begin with simple cases though. since the dpas size on PVC is 8x16 = 8x16 * 16x16 the first test cases is listed below: gemm.8x16x1024.mlir gemm.16x32x1024.1wg.4warp.mlir gemm.16x32x1024.4wg.1warp.mlir gemm.16x32x1024.1wg.1warp.mlir
Is this work item intended to show MLIR code snippets as examples or is it going to be used to modify the Triton compiler (backend) ?
Also, I understand that in order to use SIMD codegen we need to have additional information in the Triton source, information conveyed by Triton blocked pointers. Therefore in the absence of blocked pointers the codegen for tt.dot
would follow the SIMT model. Please confirm.
Is this work item intended to show MLIR code snippets as examples or is it going to be used to modify the Triton compiler (backend) ?
Also, I understand that in order to use SIMD codegen we need to have additional information in the Triton source, information conveyed by Triton blocked pointers. Therefore in the absence of blocked pointers the codegen for
tt.dot
would follow the SIMT model. Please confirm.
yes Ettore, your are right, the simd path aims to support cases with block pointer, the traditional tensor of pointers will stay as is. I'm going to add a few more passes to make triton-> tritongpu to a form that is suitable for mapping to vc-intrinsics. and the above test serves as a start point.
12/25/2023: add tritongpu/xegpu/spirv test cases for simple gemm https://github.com/intel/intel-xpu-backend-for-triton/compare/main...gemm_simd
1/8/2024: add ttg2spirv(vc-Intrinsics)lowering; update test; fix ttg/spirv running error now it can run from ttg source and get correct result.
1/15/2024: add necessary triton patches add triton ops; add end2end test add pass tritongpu-distribute-to-warps commit
1/23/2024: bug fix, gemm_256x256x1024 can run correctly
1/29/2024: add a pass tritongpu-match-target-size match the target size of specific op (dot, load, store) e.g. 32x64xf32 = tt.dot 32x32xf16, 32x64xf16 Can split to multiple tt.dot to match the dpas size(8x16) 8x16xf32 = tt.dot 8x16xf16, 16x16xf16
02/07/2024: refactor the pass tritongpu-match-target-size, fix bugs, now gemm_256x256x1024 can run correctly.
@Dewei-Wang-sh would you mind sharing the code for this feature in a draft PR. IS OK to work on a draft PR even if the code is not in its final form. Having a draft PR helps to geta preview of the direction the work is going and is a good way to gather early feedback.
@Dewei-Wang-sh would you mind sharing the code for this feature in a draft PR. IS OK to work on a draft PR even if the code is not in its final form. Having a draft PR helps to geta preview of the direction the work is going and is a good way to gather early feedback.
sure, let me port the code to the llvm-target branch
02/26/2024: go from a handwritten optimized tritongpu IR and lowering to spirv, now it can reach ~290TFlops for a 4k*4k*4k*f16 gemm.
03/04/2024: add a pass convert-triton-to-tritongpu-warp Convert Triton to TritonGPU with a blocked layout(warp distribute annotation) it can run correctly end2end with a gemm 4k*4k*4k*fp16
Kernel test_kernel : 239 registers
the kernel execution time is (ms, on L0 runtime):avg: 0.9411, min: 0.9270, max: 0.9726 (over 100 runs)
Kernel test_kernel : 239 registers
the kernel execution time is (ms, on L0 runtime):avg: 0.9420, min: 0.9272, max: 0.9755 (over 100 runs)
Kernel test_kernel : 239 registers
the kernel execution time is (ms, on L0 runtime):avg: 0.9435, min: 0.9266, max: 0.9702 (over 100 runs)
Kernel test_kernel : 239 registers
>>> 2*4096**3/0.9270/1e9
148 TFlops
all the sub-issues closed with pr merged
get triton gemm perf 80% of oneDNN/XeTLA utilizing genISA/vc-intrinsics. the lowering pipeline would be "triton -> tritongpu -> optimized/simplified tritongpu => llvm/spirv".
this serves as an umbrella issue including
406
407
408
612
613