Open qedawkins opened 2 months ago
Thanks @qedawkins . Most of this makes sense. I'll make time tomorrow to walk through more if you have the time. The only part that is a bit sketchy for me is the "fuses parallel loops". If they are really fusable, you should be able to tile + fuse. For example, your end-state code looks very similar to what you get from tile and fuse. Not to say we never need "fusion of loops", but more that we havent needed it so far....
Here is a dump-after-all with the current branch + spec above: https://gist.github.com/qedawkins/953b4e9da86ad48c94b978323f2b39ae. The key IR we're trying to get to is the following
func.func @main() {
%c32 = arith.constant 32 : index
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
%c64 = arith.constant 64 : index
%c4 = arith.constant 4 : index
%c128 = arith.constant 128 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<128x128xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<128x128xf32>>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readwrite:tensor<128x128xf32>>
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [128, 128], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128x128xf32>> -> tensor<128x128xf32>
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [128, 128], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128x128xf32>> -> tensor<128x128xf32>
%5 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [128, 128], strides = [1, 1] : !flow.dispatch.tensor<readwrite:tensor<128x128xf32>> -> tensor<128x128xf32>
%6 = tensor.empty() : tensor<128x4xf32>
%7 = tensor.empty() : tensor<4x128xf32>
%8 = scf.forall (%arg0, %arg1) in (8, 8) shared_outs(%arg2 = %5) -> (tensor<128x128xf32>) {
%9 = affine.apply affine_map<(d0) -> (d0 * 16)>(%arg0)
%10 = affine.apply affine_map<(d0) -> (d0 * 16)>(%arg1)
%extracted_slice = tensor.extract_slice %arg2[%9, %10] [16, 16] [1, 1] : tensor<128x128xf32> to tensor<16x16xf32>
%11 = affine.apply affine_map<(d0, d1) -> (d0 * 8 + d1)>(%arg0, %arg1)
%12:2 = affine.delinearize_index %11 into (%c64, %c1) : index, index
%13 = affine.apply affine_map<(d0) -> (d0 * 2)>(%12#0)
%14 = affine.apply affine_map<(d0) -> (d0 * 4)>(%12#1)
%extracted_slice_0 = tensor.extract_slice %6[%13, %14] [2, 4] [1, 1] : tensor<128x4xf32> to tensor<2x4xf32>
%15:2 = affine.delinearize_index %11 into (%c2, %c32) : index, index
%16 = affine.apply affine_map<(d0) -> (d0 * 2)>(%15#0)
%17 = affine.apply affine_map<(d0) -> (d0 * 4)>(%15#1)
%extracted_slice_1 = tensor.extract_slice %7[%16, %17] [2, 4] [1, 1] : tensor<4x128xf32> to tensor<2x4xf32>
%18 = scf.for %arg3 = %c0 to %c128 step %c4 iter_args(%arg4 = %extracted_slice) -> (tensor<16x16xf32>) {
%19 = affine.apply affine_map<(d0)[s0] -> (d0 * 4 + s0)>(%12#1)[%arg3]
%extracted_slice_2 = tensor.extract_slice %3[%13, %19] [2, 4] [1, 1] : tensor<128x128xf32> to tensor<2x4xf32>
%20 = linalg.copy ins(%extracted_slice_2 : tensor<2x4xf32>) outs(%extracted_slice_0 : tensor<2x4xf32>) -> tensor<2x4xf32>
%21 = iree_gpu.shuffle_tensor %20[%13, %14] [2, 4] [1, 1] to %6 [%9, 0] [16, 4] [1, 1] : tensor<2x4xf32> -> tensor<128x4xf32> -> tensor<16x4xf32>
%22 = affine.apply affine_map<(d0)[s0] -> (d0 * 2 + s0)>(%15#0)[%arg3]
%extracted_slice_3 = tensor.extract_slice %4[%22, %17] [2, 4] [1, 1] : tensor<128x128xf32> to tensor<2x4xf32>
%23 = linalg.copy ins(%extracted_slice_3 : tensor<2x4xf32>) outs(%extracted_slice_1 : tensor<2x4xf32>) -> tensor<2x4xf32>
%24 = iree_gpu.shuffle_tensor %23[%16, %17] [2, 4] [1, 1] to %7 [0, %10] [4, 16] [1, 1] : tensor<2x4xf32> -> tensor<4x128xf32> -> tensor<4x16xf32>
%25 = linalg.matmul ins(%21, %24 : tensor<16x4xf32>, tensor<4x16xf32>) outs(%arg4 : tensor<16x16xf32>) -> tensor<16x16xf32>
scf.yield %25 : tensor<16x16xf32>
}
scf.forall.in_parallel {
tensor.parallel_insert_slice %18 into %arg2[%9, %10] [16, 16] [1, 1] : tensor<16x16xf32> into tensor<128x128xf32>
}
} {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
flow.dispatch.tensor.store %8, %2, offsets = [0, 0], sizes = [128, 128], strides = [1, 1] : tensor<128x128xf32> -> !flow.dispatch.tensor<readwrite:tensor<128x128xf32>>
return
}
I updated the branch to enable generating MFMA ops as well with this spec: https://gist.github.com/qedawkins/334c6bce944c6b860066ca873e1388d2
I'm going to start landing some of the transform ops used in the above spec.
Overview
Currently there are four main ways to generate code for matmuls across the LLVMGPU and SPIR-V backends.
bufferization.alloc_tensor
, and distributes copies introduced by bufferization withGPUDistributeSharedMemoryCopy
.bufferization.alloc_tensor
, and does early bufferization and hard codes the thread layouts for mma.syncbufferization.alloc_tensor
, but accesses the copies with vector transfers. Then distributes block-level vector code directly to threads.tensor.pad
, and distributes the copies and matmul with their ownscf.forall
.(Note this issue is excluding cooperative matrix on SPIR-V due to the specific requirements there).
All of these strategies have notable pros/cons with respect to how well they are able to handle fusions and target different accelerated instructions. The point that is the focus of this issue is the ability to do producer fusions. Out of the above list, the only approach that is able to handle producer fusions robustly is 4) because the tiling + distribution is planned completely separately from the tiling + distribution of the matmul. The shared memory allocation between the two is what bridges the difference in distribution across the workgroup. This is especially relevant when trying to do something like implicit GEMM for NCHW convolutions. The distribution of
im2col
needs to be decided based on the way the kernel accesses data, not how it is used in the core computation.Proof of Concept
The following branch is a proof of concept for a way to organize convolution and matmul codegen for GPU targets based partially on 4) but with changes to fix some of the cons listed above: https://github.com/qedawkins/iree/tree/igemm. An accompanying script can be found here: https://gist.github.com/qedawkins/ee0ca928634b5533b591ce804fa5e080
The experimental strategy here also tiles the producers of the matmul on their own (in this case manually introduced copies), however instead of waiting until after bufferization and loops are distributed to "fix up" the iterator type of the reduction loop, all of those parallel loops within the body of the
scf.for
are fused into one, allowing further hoisting out of the loop and fusion of consumers.After fusion of parallel loops
The current rough outline for the pipeline is as follows
scf.for
and greedily fuse all consumers of the loop.scf.forall
ops and late stage target specific lowerings.With this approach, the only difference with convolution is the presence of an
im2col
operation introduced between steps 1 and 2. Theim2col
operation will just need to implement the tiling interface and have a way to decompose to linalg/vector ops once tiled to threads.Tasks
A list of rough tasks, some of which are already done in the above or started elsewhere.
Shared
scf.forall
ops.shuffle_tensor
operation to bridge the boundary between fusedscf.forall
ops.scf.forall
loops out of an enclosingscf.for
loop.scf.forall
from one thread mapping + count to another.shuffle_tensor
and possibly more formal pipeline stages.Convolution
linalg_ext.im2col
op for NCHW convolutions to avoid gather vectorization