intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
119 stars 33 forks source link

GEMM Block-pointer Path #171

Closed Dewei-Wang-sh closed 4 months ago

Dewei-Wang-sh commented 8 months ago

get triton gemm perf 80% of oneDNN/XeTLA utilizing genISA/vc-intrinsics. the lowering pipeline would be "triton -> tritongpu -> optimized/simplified tritongpu => llvm/spirv".

this serves as an umbrella issue including

Dewei-Wang-sh commented 8 months ago

the ultimate goal is to make gemm4Kx4Kx4K perf up to 80% of XeTLA. will begin with simple cases though. since the dpas size on PVC is 8x16 = 8x16 * 16x16 the first test cases is listed below: gemm.8x16x1024.mlir gemm.16x32x1024.1wg.4warp.mlir gemm.16x32x1024.4wg.1warp.mlir gemm.16x32x1024.1wg.1warp.mlir

etiotto commented 8 months ago

Is this work item intended to show MLIR code snippets as examples or is it going to be used to modify the Triton compiler (backend) ?

Also, I understand that in order to use SIMD codegen we need to have additional information in the Triton source, information conveyed by Triton blocked pointers. Therefore in the absence of blocked pointers the codegen for tt.dot would follow the SIMT model. Please confirm.

Dewei-Wang-sh commented 8 months ago

Is this work item intended to show MLIR code snippets as examples or is it going to be used to modify the Triton compiler (backend) ?

Also, I understand that in order to use SIMD codegen we need to have additional information in the Triton source, information conveyed by Triton blocked pointers. Therefore in the absence of blocked pointers the codegen for tt.dot would follow the SIMT model. Please confirm.

yes Ettore, your are right, the simd path aims to support cases with block pointer, the traditional tensor of pointers will stay as is. I'm going to add a few more passes to make triton-> tritongpu to a form that is suitable for mapping to vc-intrinsics. and the above test serves as a start point.

Dewei-Wang-sh commented 8 months ago

12/25/2023: add tritongpu/xegpu/spirv test cases for simple gemm https://github.com/intel/intel-xpu-backend-for-triton/compare/main...gemm_simd

Dewei-Wang-sh commented 7 months ago

1/8/2024: add ttg2spirv(vc-Intrinsics)lowering; update test; fix ttg/spirv running error now it can run from ttg source and get correct result.

Dewei-Wang-sh commented 7 months ago

1/15/2024: add necessary triton patches add triton ops; add end2end test add pass tritongpu-distribute-to-warps commit

Dewei-Wang-sh commented 7 months ago

1/23/2024: bug fix, gemm_256x256x1024 can run correctly

Dewei-Wang-sh commented 7 months ago

1/29/2024: add a pass tritongpu-match-target-size match the target size of specific op (dot, load, store) e.g. 32x64xf32 = tt.dot 32x32xf16, 32x64xf16 Can split to multiple tt.dot to match the dpas size(8x16) 8x16xf32 = tt.dot 8x16xf16, 16x16xf16

Dewei-Wang-sh commented 6 months ago

02/07/2024: refactor the pass tritongpu-match-target-size, fix bugs, now gemm_256x256x1024 can run correctly.

etiotto commented 6 months ago

@Dewei-Wang-sh would you mind sharing the code for this feature in a draft PR. IS OK to work on a draft PR even if the code is not in its final form. Having a draft PR helps to geta preview of the direction the work is going and is a good way to gather early feedback.

Dewei-Wang-sh commented 6 months ago

@Dewei-Wang-sh would you mind sharing the code for this feature in a draft PR. IS OK to work on a draft PR even if the code is not in its final form. Having a draft PR helps to geta preview of the direction the work is going and is a good way to gather early feedback.

sure, let me port the code to the llvm-target branch

Dewei-Wang-sh commented 6 months ago

02/26/2024: go from a handwritten optimized tritongpu IR and lowering to spirv, now it can reach ~290TFlops for a 4k*4k*4k*f16 gemm.

Dewei-Wang-sh commented 5 months ago

03/04/2024: add a pass convert-triton-to-tritongpu-warp Convert Triton to TritonGPU with a blocked layout(warp distribute annotation) it can run correctly end2end with a gemm 4k*4k*4k*fp16

        Kernel test_kernel : 239 registers
the kernel execution time is (ms, on L0 runtime):avg: 0.9411, min: 0.9270, max: 0.9726 (over 100 runs)
        Kernel test_kernel : 239 registers
the kernel execution time is (ms, on L0 runtime):avg: 0.9420, min: 0.9272, max: 0.9755 (over 100 runs)
        Kernel test_kernel : 239 registers
the kernel execution time is (ms, on L0 runtime):avg: 0.9435, min: 0.9266, max: 0.9702 (over 100 runs)
        Kernel test_kernel : 239 registers
>>> 2*4096**3/0.9270/1e9
148 TFlops
Dewei-Wang-sh commented 4 months ago

all the sub-issues closed with pr merged