intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
138 stars 42 forks source link

Investigate usage of OpenCL functions instead of GenISA for DPAS #474

Closed whitneywhtsang closed 6 months ago

whitneywhtsang commented 8 months ago

IGC team suggested to use the OpenCL functions defined in https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_subgroup_matrix_multiply_accumulate.html, instead of using GenISA intrinsic directly.

There are two sets of OpenCL cl_intel_subgroup_matrix_multiply_accumulate functions, differentiate by the first argument type, int type for devices with minimum subgroup size 8 (e.g., ATS), and short type for devices with minimum subgroup size 16 (e.g., PVC). Note: calling functions on wrong devices or from kernels with a different subgroup size is undefined behavior.

Let's take a subset of those functions as a lowering examples: On PVC:

// 8-bit matrices:
int  intel_sub_group_i8_i8_matrix_mad_k32(short   a, int8  b, int  acc);  // M = 1
i32 @__builtin_IB_sub_group16_idpas_s8_s8_8_1(i32 noundef %acc, i16 noundef signext %a, <8 x i32> noundef %b)
i32 @llvm.genx.GenISA.sub.group.dpas.i32.i32.i16.v8i32(i32, i16, <8 x i32>, i32 4, i32 4, i32 8, i32 1, i1 false)

int2 intel_sub_group_i8_i8_matrix_mad_k32(short2  a, int8  b, int2 acc);  // M = 2
<2 x i32> @__builtin_IB_sub_group16_idpas_s8_s8_8_2(<2 x i32> noundef %acc, <2 x i16> noundef %a, <8 x i32> noundef %b) 
<2 x i32> @llvm.genx.GenISA.sub.group.dpas.v2i32.v2i32.v2i16.v8i32(<2 x i32>, <2 x i16>, <8 x i32>, i32 4, i32 4, i32 8, i32 2, i1 false)

int4 intel_sub_group_i8_i8_matrix_mad_k32(short4  a, int8  b, int4 acc);  // M = 4
<4 x i32> @__builtin_IB_sub_group16_idpas_s8_s8_8_4(<4 x i32> noundef %acc, <4 x i16> noundef %a, <8 x i32> noundef %b)
<4 x i32> @llvm.genx.GenISA.sub.group.dpas.v4i32.v4i32.v4i16.v8i32(<4 x i32>, <4 x i16>, <8 x i32>, i32 4, i32 4, i32 8, i32 4, i1 false)

int8 intel_sub_group_i8_i8_matrix_mad_k32(short8  a, int8  b, int8 acc);  // M = 8
<8 x i32> @__builtin_IB_sub_group16_idpas_s8_s8_8_8(<8 x i32> noundef %acc, <8 x i16> noundef %a, <8 x i32> noundef %b)
<8 x i32> @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32(<8 x i32>, <8 x i16>, <8 x i32>, i32 4, i32 4, i32 8, i32 8, i1 false)

The OpenCL functions will first lower to __builtin_IB functions, then to the GenISA intrinsics. If a SIMD8 OpenCL function is compiled for PVC, then it would not be lowered to a GenISA intrinsic, e.g.,

int8 intel_sub_group_i8_i8_matrix_mad_k32(int8  a, int8  b, int8 acc); // M = 8
<8 x i32> @__builtin_IB_sub_group_idpas_s8_s8_8_8(<8 x i32> noundef %40, <8 x i32> noundef %72, <8 x i32> noundef %56)
// Not lowered to GenISA intrinsic

On ATS:

int  intel_sub_group_i8_i8_matrix_mad_k32(int   a, int8  b, int  acc);  // M = 1
i32 @__builtin_IB_sub_group_idpas_s8_s8_8_1(i32 noundef %acc, i32 noundef %a, <8 x i32> noundef %b)
i32 @llvm.genx.GenISA.sub.group.dpas.i32.i32.i32.v8i32(i32, i32, <8 x i32>, i32 4, i32 4, i32 8, i32 1, i1 false)

int2 intel_sub_group_i8_i8_matrix_mad_k32(int2  a, int8  b, int2 acc);  // M = 2
<2 x i32> @__builtin_IB_sub_group_idpas_s8_s8_8_2(<2 x i32> noundef %acc, <2 x i32> noundef %a, <8 x i32> noundef %b)
<2 x i32> @llvm.genx.GenISA.sub.group.dpas.v2i32.v2i32.v2i32.v8i32(<2 x i32>, <2 x i32>, <8 x i32>, i32 4, i32 4, i32 8, i32 2, i1 false)

int4 intel_sub_group_i8_i8_matrix_mad_k32(int4  a, int8  b, int4 acc);  // M = 4
<4 x i32> @__builtin_IB_sub_group_idpas_s8_s8_8_4(<4 x i32> noundef %acc, <4 x i32> noundef %a, <8 x i32> noundef %b)
<4 x i32> @llvm.genx.GenISA.sub.group.dpas.v4i32.v4i32.v4i32.v8i32(<4 x i32>, <4 x i32>, <8 x i32>, i32 4, i32 4, i32 8, i32 4, i1 false)

int8 intel_sub_group_i8_i8_matrix_mad_k32(int8  a, int8  b, int8 acc);  // M = 8
<8 x i32> @__builtin_IB_sub_group_idpas_s8_s8_8_8(<8 x i32> noundef %acc, <8 x i32> noundef %a, <8 x i32> noundef %b)
<8 x i32> @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i32.v8i32(<8 x i32>, <8 x i32>, <8 x i32>, i32 4, i32 4, i32 8, i32 8, i1 false)
whitneywhtsang commented 8 months ago

Since https://github.com/intel/llvm/commit/40a18fa33230a92566dc348aec70da1e253f65ca, we changed to lower to @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 from @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32, the motivation is to have better code generation when IGC do vectorization.

@llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 cannot be generated by using OpenCL functions, as int8 intel_sub_group_i8_i8_matrix_mad_k32(int8 a, int8 b, int8 acc); is not supported on PVC.

With https://github.com/intel/intel-xpu-backend-for-triton/pull/356, a number of DPAS test cases pass with @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 on PVC. Is using @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 on PVC undefined behavior? If it is undefined behavior, then we need to change to lower to @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32, and improve the code generation in a different way, e.g., have more control of IGC vectorization optimization. Or can IGC be changed to allow v4i32 type for operand A on PVC?

pengtu commented 8 months ago

The @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v8i16.v8i32 requires a register layout to pack 2 columns of operand 'a' into a i16 short in each SIMD lane. With a SIMD execution width of 16, the number of elements to pack into 1 SIMD lane = 32/16 = 2. The @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 requires a register layout to pack 4 columns of operand 'a' into a i32, which only works with SIMD 8 since the number of elements to pack into a SIMD lane = 32/8 = 4.

The @llvm.genx.GenISA.sub.group.dpas.v8i32.v8i32.v4i32.v8i32 may happen to work on ATS under SIMD16 since ATS have a dpas.w instruction, which pairs two EUs to perform a SIMD 16 dpas. However, the layout is different from PVC's native SIMD16. With dpas.w, the first EU will operates on the first 32 columns of 'a' and the second EU on the last 32 columns of 'a', each SIMD lane still packs 4 columns of 'a' instead of PVC's 2.

Triton for Intel GPU will need to support two different DPAS layout versions for ATS and PVC.

whitneywhtsang commented 6 months ago

Created https://github.com/intel/intel-xpu-backend-for-triton/issues/842 to track using OCL builtin for tf32 dpas.