Closed whitneywhtsang closed 6 months ago
Since https://github.com/intel/llvm/commit/40a18fa33230a92566dc348aec70da1e253f65ca, we changed to lower to @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32
from @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32
, the motivation is to have better code generation when IGC do vectorization.
@llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32
cannot be generated by using OpenCL functions, as int8 intel_sub_group_i8_i8_matrix_mad_k32(int8 a, int8 b, int8 acc);
is not supported on PVC.
With https://github.com/intel/intel-xpu-backend-for-triton/pull/356, a number of DPAS test cases pass with @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32
on PVC. Is using @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32
on PVC undefined behavior? If it is undefined behavior, then we need to change to lower to @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32
, and improve the code generation in a different way, e.g., have more control of IGC vectorization optimization. Or can IGC be changed to allow v4i32 type for operand A on PVC?
The @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32 requires a register layout to pack 2 columns of operand 'a' into a i16 short in each SIMD lane. With a SIMD execution width of 16, the number of elements to pack into 1 SIMD lane = 32/16 = 2. The @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 requires a register layout to pack 4 columns of operand 'a' into a i32, which only works with SIMD 8 since the number of elements to pack into a SIMD lane = 32/8 = 4.
The @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 may happen to work on ATS under SIMD16 since ATS have a dpas.w instruction, which pairs two EUs to perform a SIMD 16 dpas. However, the layout is different from PVC's native SIMD16. With dpas.w, the first EU will operates on the first 32 columns of 'a' and the second EU on the last 32 columns of 'a', each SIMD lane still packs 4 columns of 'a' instead of PVC's 2.
Triton for Intel GPU will need to support two different DPAS layout versions for ATS and PVC.
Created https://github.com/intel/intel-xpu-backend-for-triton/issues/842 to track using OCL builtin for tf32 dpas.
IGC team suggested to use the OpenCL functions defined in https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html, instead of using GenISA intrinsic directly.
There are two sets of OpenCL
cl_intel_subgroup_matrix_multiply_accumulate
functions, differentiate by the first argument type,int
type for devices with minimum subgroup size 8 (e.g., ATS), andshort
type for devices with minimum subgroup size 16 (e.g., PVC). Note: calling functions on wrong devices or from kernels with a different subgroup size is undefined behavior.Let's take a subset of those functions as a lowering examples: On PVC:
The OpenCL functions will first lower to
__builtin_IB
functions, then to the GenISA intrinsics. If a SIMD8 OpenCL function is compiled for PVC, then it would not be lowered to a GenISA intrinsic, e.g.,On ATS: