Closed etiotto closed 6 months ago
@chengjunlu I am hoping you will be able to take your DPAS support in the SPIRV generating Triton compiler and port the code in this new branch.
@chengjunlu I am hoping you will be able to take your DPAS support in the SPIRV generating Triton compiler and port the code in this new branch.
Yes. I will start to backport the DPAS feature to LLVM target. Before backporting the DPAS to the LLVM target branch, I'd like to update the LLVM target branch to the latest Triton main branch. Because there are two PRs for decoupling the MMA layout attributes with CUDA has been landed on that. It can help us to reuse the TritonGPU optimization passes.
The steps for supporting the tt.dot
would be:
Functionalitiy:
tt.dot
feature for functionality. Performance optimization:
tt.dot
to remove un-necessary layout conversion thru SLM.I will start the task in Jan.2024.
The LLVM branch is pretty close to the top of tree. We merged a PR today to move it up a bit. OpenAI has bumped up the version of LLVM/MLIR they use, and so we will have to merge that commit into the GENX branch repo first. We can discuss the steps in Jan.
The tt.dot
operation is functional when we fallback into generating a loop with FMA instructions. We have support for the DPAS instruction in the GENX dialect already and we have some support for the DPASLayout (counterpart of the MMA layout). So some codegen support is in the branch already.
The plan for performance optimizations is good.
Comparing NV vs GEN, the difference starts with the ttgir: NV version uses #mma layout where the GEN version uses #block layout for the dot operator. Hence the difference is in the Triton IR to Triton GPU IR lowering. Please see the attached ttgir files for details. matmul_kernel_nv.ttgir.txt matmul_kernel_pvc.ttgir.txt
The vectorization in IGC for the DPAS intrinsic is a little complicate. I found a DPAS operands A and B packing configuration that workable for both the subgroup size 8/16 for both the ATSM and PVC.
The workable packing is here: https://github.com/chengjunlu/llvm/blob/947fcc6f2f781a6f49c2ce2f737f74b65222e573/mlir/lib/Target/LLVMIR/Dialect/GENX/GENXToLLVMIRTranslation.cpp#L148
I am looking forward the workable packing configuration for the subgroup size 32 for the ATSM and PVC.
The functionality (dpas lowing) is ready, GEMM test case can pass. Will provide a GENX Dialect PR for review.
@tdeng5 did it pass the unit tests fully?
@kurapov-peter I am working on enabling the DPAS on both PVC and ATSM. https://github.com/intel/intel-xpu-backend-for-triton/pull/356
I have a question that how to get the device ID in L0. Or is there any other properties I can use to check the device is PVC or ATSM?
@chengjunlu, there's a ze_device_properties_t
(see https://spec.oneapi.io/level-zero/latest/core/api.html#ze-device-properties-t) that you can query with zeDeviceGetProperties. You probably don't actually need the ID though, but the compute properties that contain the maximum launch parameters and similar? Those can be queried with zeDeviceGetComputeProperties.
And if you want to query available features those are usually in the module properties.
Hi Peter, Thanks for the information. The L0 is going to be abstracted for different accelerators including GPU and VPU and so on. It seems it is hard to get the Intel GPU's hardware capability from L0 directly because that maybe not general. Like whether there is DMA engine controlled by XeCore for moving the data asynchronisely.
I think the easiest way is using the device ID and we keep the information in Triton for short time. I can use the PCI Device ID to identify the ATSM or PVC.
Can I get that PCI Device ID or something I can use to identify the GPU arch?
@chengjunlu: sycl::device::get_info
@Sarbojit2019: We may want to add the device information to the Intel driver.py, similar to CUDA's 'capability' to pass the device architecture information to the compile.py
Yup, there are deviceId
and vendorId
fields in the ze_device_properties_t
. You can get the same id from sycl as Peng mentions too.
I created a PR that add this feature to get_current_target() #369
The most part of changes to lower the tt.dot
to dpas
has been back ported to the LLVM target branch.
I am still debugging two new issues in the LLVM target: (I am working on ATSM so far)
dpas
couldn't output the correct value. @chengjunlu as per our offline discussion yesterday could you please create new tickets for remaining issues, for example unsupported data dtypes and proceed further with the PR merge.
@chengjunlu as per our offline discussion yesterday could you please create new tickets for remaining issues, for example unsupported data dtypes and proceed further with the PR merge.
Yes. I will create the new tickets for the issues I met in the testing. There are three remaining issues:
There are total 3 PRs for this task: 1 for GenX dialect and 2 for Triton XPU.
The GenX dialect PR is here: https://github.com/intel/llvm/pull/12554
The first one to Triton XPU is to support variant threads_per_warp number in FE to generate the kernel. And change the default threads_per_warp number to 16. https://github.com/intel/intel-xpu-backend-for-triton/pull/414
The PR for lowering the tt.dot
to DPAS
is:
https://github.com/intel/intel-xpu-backend-for-triton/pull/356
It depends on the previous two got merged.
Currently the
tt.dot
operation is lowered to a loop containing scalar (FMA) instructions. This works from a functional perspective but performs poorly.This work item entails extending the conversion code for
tt.dot
so that it generates DPAS instruction(s) via the GENX dialect operation@llvm.genx.GenISA.sub.group.dpas
In this first step performance is not the main objective. The operand of thett.dot
operation are expected to be put into shared local memory by converting from a block layout to a shared layout (like it is the case for NVidia). Once in shared memory the operands are to be converted to a dot layout (with a underlying DPAS layout).Efforts/experiments to elide the blocked layout to shared layout conversions, should be handled separately from this work item.