[DPAS]: Initial implementation for Triton's `tt.dot` operation using the DPAS instruction

etiotto commented 8 months ago

Currently the tt.dot operation is lowered to a loop containing scalar (FMA) instructions. This works from a functional perspective but performs poorly.

This work item entails extending the conversion code for tt.dot so that it generates DPAS instruction(s) via the GENX dialect operation @llvm.genx.GenISA.sub.group.dpas In this first step performance is not the main objective. The operand of the tt.dot operation are expected to be put into shared local memory by converting from a block layout to a shared layout (like it is the case for NVidia). Once in shared memory the operands are to be converted to a dot layout (with a underlying DPAS layout).

Efforts/experiments to elide the blocked layout to shared layout conversions, should be handled separately from this work item.

etiotto commented 8 months ago

@chengjunlu I am hoping you will be able to take your DPAS support in the SPIRV generating Triton compiler and port the code in this new branch.

chengjunlu commented 8 months ago

@chengjunlu I am hoping you will be able to take your DPAS support in the SPIRV generating Triton compiler and port the code in this new branch.

Yes. I will start to backport the DPAS feature to LLVM target. Before backporting the DPAS to the LLVM target branch, I'd like to update the LLVM target branch to the latest Triton main branch. Because there are two PRs for decoupling the MMA layout attributes with CUDA has been landed on that. It can help us to reuse the TritonGPU optimization passes.

The steps for supporting the tt.dot would be: Functionalitiy:

Update the LLVM target branch to the latest Triton.
Backport the tt.dot feature for functionality.

Performance optimization:

Backport the loop software pipelining with prefetch ops for Intel PVC/ATSM to use the cache instead of SLM.
Backport the backward layout propagation of the tt.dot to remove un-necessary layout conversion thru SLM.

I will start the task in Jan.2024.

etiotto commented 8 months ago

The LLVM branch is pretty close to the top of tree. We merged a PR today to move it up a bit. OpenAI has bumped up the version of LLVM/MLIR they use, and so we will have to merge that commit into the GENX branch repo first. We can discuss the steps in Jan.

The tt.dot operation is functional when we fallback into generating a loop with FMA instructions. We have support for the DPAS instruction in the GENX dialect already and we have some support for the DPASLayout (counterpart of the MMA layout). So some codegen support is in the branch already.

The plan for performance optimizations is good.

pengtu commented 7 months ago

Comparing NV vs GEN, the difference starts with the ttgir: NV version uses #mma layout where the GEN version uses #block layout for the dot operator. Hence the difference is in the Triton IR to Triton GPU IR lowering. Please see the attached ttgir files for details. matmul_kernel_nv.ttgir.txt matmul_kernel_pvc.ttgir.txt

chengjunlu commented 7 months ago

The vectorization in IGC for the DPAS intrinsic is a little complicate. I found a DPAS operands A and B packing configuration that workable for both the subgroup size 8/16 for both the ATSM and PVC.

The workable packing is here: https://github.com/chengjunlu/llvm/blob/947fcc6f2f781a6f49c2ce2f737f74b65222e573/mlir/lib/Target/LLVMIR/Dialect/GENX/GENXToLLVMIRTranslation.cpp#L148

I am looking forward the workable packing configuration for the subgroup size 32 for the ATSM and PVC.

tdeng5 commented 7 months ago

The functionality (dpas lowing) is ready, GEMM test case can pass. Will provide a GENX Dialect PR for review.

aregm commented 7 months ago

@tdeng5 did it pass the unit tests fully?

chengjunlu commented 7 months ago

@kurapov-peter I am working on enabling the DPAS on both PVC and ATSM. https://github.com/intel/intel-xpu-backend-for-triton/pull/356

I have a question that how to get the device ID in L0. Or is there any other properties I can use to check the device is PVC or ATSM?

kurapov-peter commented 7 months ago

@chengjunlu, there's a ze_device_properties_t (see https://spec.oneapi.io/level-zero/latest/core/api.html#ze-device-properties-t) that you can query with zeDeviceGetProperties. You probably don't actually need the ID though, but the compute properties that contain the maximum launch parameters and similar? Those can be queried with zeDeviceGetComputeProperties.

kurapov-peter commented 7 months ago

And if you want to query available features those are usually in the module properties.

chengjunlu commented 7 months ago

Hi Peter, Thanks for the information. The L0 is going to be abstracted for different accelerators including GPU and VPU and so on. It seems it is hard to get the Intel GPU's hardware capability from L0 directly because that maybe not general. Like whether there is DMA engine controlled by XeCore for moving the data asynchronisely.

I think the easiest way is using the device ID and we keep the information in Triton for short time. I can use the PCI Device ID to identify the ATSM or PVC.

Can I get that PCI Device ID or something I can use to identify the GPU arch?

pengtu commented 7 months ago

@chengjunlu: sycl::device::get_info() returns a string like the following: Intel(R) UHD Graphics [0x9a60]. The PCI Device ID is inside [].

@Sarbojit2019: We may want to add the device information to the Intel driver.py, similar to CUDA's 'capability' to pass the device architecture information to the compile.py

kurapov-peter commented 7 months ago

Yup, there are deviceId and vendorId fields in the ze_device_properties_t. You can get the same id from sycl as Peng mentions too.

pengtu commented 7 months ago

I created a PR that add this feature to get_current_target() #369

chengjunlu commented 7 months ago

The most part of changes to lower the tt.dot to dpas has been back ported to the LLVM target branch.

I am still debugging two new issues in the LLVM target: (I am working on ATSM so far)

The test case MNK for 16-16-16 with 2 stages pipelining on single warp hit one critical issue for correctness. 4x8 values of the total 8x8 dpas results are all 0. The issue may related to the LLVM IR optimization.
The TF32 data type dpas couldn't output the correct value.

vlad-penkin commented 7 months ago

@chengjunlu as per our offline discussion yesterday could you please create new tickets for remaining issues, for example unsupported data dtypes and proceed further with the PR merge.

chengjunlu commented 7 months ago

@chengjunlu as per our offline discussion yesterday could you please create new tickets for remaining issues, for example unsupported data dtypes and proceed further with the PR merge.

Yes. I will create the new tickets for the issues I met in the testing. There are three remaining issues:

The dpas results are not correct for the D type is fp16. #400
The dpas results are not correct for the A and B type is TF32. #402
The LLVM optimization O3 on Triton side causes the incorrect result in the 16x16x16 in single warp test case. #403

chengjunlu commented 7 months ago

There are total 3 PRs for this task: 1 for GenX dialect and 2 for Triton XPU.

The GenX dialect PR is here: https://github.com/intel/llvm/pull/12554

The first one to Triton XPU is to support variant threads_per_warp number in FE to generate the kernel. And change the default threads_per_warp number to 16. https://github.com/intel/intel-xpu-backend-for-triton/pull/414

The PR for lowering the tt.dot to DPAS is: https://github.com/intel/intel-xpu-backend-for-triton/pull/356 It depends on the previous two got merged.

intel / intel-xpu-backend-for-triton

[DPAS]: Initial implementation for Triton's `tt.dot` operation using the DPAS instruction #145