intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
143 stars 44 forks source link

[DPAS Layout] There is an issue in broadcast the vector to matrix along the row dim when sub-group-size is 32. #781

Closed chengjunlu closed 7 months ago

chengjunlu commented 7 months ago

There is a case in tt_dot uses the broadcast to make a matrix from a vector. The IR is like this:

#mma = #triton_intel_gpu.dpas<{repeatCount = 8, systolicDepth = 8, executionSize = 16, opsPerChan = 2, threadsPerWarp = 32, warpsPerCTA = [2, 2], A = [8, 16], B = [16, 16], C = [8, 16]}>
%36 = tt.broadcast %35 : tensor<64x1xf32, #mma> -> tensor<64x64xf32, #mma> loc(#loc19)

There is no issue when the threadsPerWarp=16. We can simply replica the name in the SIMT with the corresponding coordinate to broadcast it naturally. But when comes with the threadsPerWarp=32, the single name in SIMT represent the 2 rows. It is not travail to use the same name in SIMT. (Maybe we need to use the sub-group-shuffle to move the value to the upper lanes.)

Need to fix that if we want to use the threadsPerWarp=32 on PVC for DPAS.

chengjunlu commented 7 months ago

The root cause seems in the function getUniqueContigPerThread. When the threadsPerWarp = 32, the DPAS layout is not contiguous on the column dimension. It is strided by 2 because one name represent two rows now.

There is same potential issue for the TF32 operand A layout when the threadsPerWarp=16. Need to fix it. A operand for TF32 with the threadsPerWarp=16: image The rows with the same color is represent by the same name in SIMT for 8x8 A matrix. Then each value is strided by 2 for column dimension.

The callstack of the convert layout op lowering for the DPAS layout:

mlir::triton::gpu::intel::DpasEncodingAttr::getSizePerThread Dialect.cpp:111
mlir::triton::gpu::detail::DistributedEncodingTraitInterfaceTraits::Model<mlir::triton::gpu::intel::DpasEncodingAttr>::getSizePerThread TritonGPUAttrInterfaces.h.inc:276
mlir::triton::gpu::DistributedEncodingTrait::getSizePerThread TritonGPUAttrInterfaces.cpp.inc:35
mlir::triton::gpu::getSizePerThread Dialect.cpp:156
mlir::triton::gpu::getContigPerThread Dialect.cpp:186
mlir::triton::gpu::getUniqueContigPerThread Dialect.cpp:206
mlir::triton::gpu::getUniqueContigPerThread Dialect.cpp:198
mlir::triton::getScratchConfigForCvtLayout Allocation.cpp:122
ConvertLayoutOpConversion::lowerDistributedToDistributed ConvertLayoutOpToLLVM.cpp:538
ConvertLayoutOpConversion::matchAndRewrite ConvertLayoutOpToLLVM.cpp:169
mlir::ConvertOpToLLVMPattern::matchAndRewrite Pattern.h:161

We can see the getContigPerThread calls the getSizePerThread for the contiguous information of the layout mapping. In the case threads_per_warp=16, it is true that the layout is contiguous on the column just as return {elemsPerThread, 1}. But it is not true when threads_per_warp=32.