[Tiling] Packing for convolution

newling commented 2 months ago

Use packing between L2 and L1 for convolution.

Using upstream MLIR packing I get the following.

func.func @conv_2d_nhwc_hwcf_dispatch_0_conv_2d_nhwc_hwcf_2x12x12x64x3x3x32_i32() attributes {translation_info = #translation} {
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c3 = arith.constant 3 : index
  %c0_i32 = arith.constant 0 : i32
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<2x14x14x32xi32>>
  %1 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<3x3x32x64xi32>>
  %2 = hal.interface.binding.subspan layout(#pipeline_layout) set(0) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>>
  %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [2, 14, 14, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<2x14x14x32xi32>> -> tensor<2x14x14x32xi32>
  %4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [3, 3, 32, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<3x3x32x64xi32>> -> tensor<3x3x32x64xi32>
  %5 = flow.dispatch.tensor.load %2, offsets = [0, 0, 0, 0], sizes = [2, 12, 12, 64], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>> -> tensor<2x12x12x64xi32>
  %6 = scf.forall (%arg0, %arg1, %arg2) = (0, 0, 0) to (12, 12, 64) step (4, 4, 4) shared_outs(%arg3 = %5) -> (tensor<2x12x12x64xi32>) {
    %extracted_slice = tensor.extract_slice %3[0, %arg0, %arg1, 0] [2, 6, 6, 32] [1, 1, 1, 1] : tensor<2x14x14x32xi32> to tensor<2x6x6x32xi32>
    %extracted_slice_0 = tensor.extract_slice %4[0, 0, 0, %arg2] [3, 3, 32, 4] [1, 1, 1, 1] : tensor<3x3x32x64xi32> to tensor<3x3x32x4xi32>
    %extracted_slice_1 = tensor.extract_slice %arg3[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : tensor<2x12x12x64xi32> to tensor<2x4x4x4xi32>
    %7 = bufferization.alloc_tensor() : tensor<2x6x6x32xi32>
    %alloc = memref.alloc() : memref<2x6x6x32xi32, 1 : i32>
    %8 = bufferization.to_tensor %alloc restrict writable : memref<2x6x6x32xi32, 1 : i32>
    %9 = linalg.copy ins(%extracted_slice : tensor<2x6x6x32xi32>) outs(%8 : tensor<2x6x6x32xi32>) -> tensor<2x6x6x32xi32>
    %10 = bufferization.alloc_tensor() : tensor<3x3x32x4xi32>
    %alloc_2 = memref.alloc() : memref<3x3x32x4xi32, 1 : i32>
    %11 = bufferization.to_tensor %alloc_2 restrict writable : memref<3x3x32x4xi32, 1 : i32>
    %12 = linalg.copy ins(%extracted_slice_0 : tensor<3x3x32x4xi32>) outs(%11 : tensor<3x3x32x4xi32>) -> tensor<3x3x32x4xi32>
    %13 = bufferization.alloc_tensor() : tensor<2x4x4x4xi32>
    %alloc_3 = memref.alloc() : memref<2x4x4x4xi32, 1 : i32>
    %14 = bufferization.to_tensor %alloc_3 restrict writable : memref<2x4x4x4xi32, 1 : i32>
    %15 = scf.forall (%arg4, %arg5, %arg6, %arg7) = (0, 0, 0, 0) to (2, 4, 4, 4) step (1, 1, 4, 4) shared_outs(%arg8 = %14) -> (tensor<2x4x4x4xi32>) {
      %extracted_slice_4 = tensor.extract_slice %9[%arg4, %arg5, %arg6, 0] [1, 3, 6, 32] [1, 1, 1, 1] : tensor<2x6x6x32xi32> to tensor<1x3x6x32xi32>
      %extracted_slice_5 = tensor.extract_slice %12[0, 0, 0, %arg7] [3, 3, 32, 4] [1, 1, 1, 1] : tensor<3x3x32x4xi32> to tensor<3x3x32x4xi32>
      %extracted_slice_6 = tensor.extract_slice %arg8[%arg4, %arg5, %arg6, %arg7] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<2x4x4x4xi32> to tensor<1x1x4x4xi32>
      %alloc_7 = memref.alloc() : memref<1x3x4x6x8xi32, 2 : i32>
      %17 = bufferization.to_tensor %alloc_7 restrict writable : memref<1x3x4x6x8xi32, 2 : i32>
      %pack = tensor.pack %extracted_slice_4 outer_dims_perm = [0, 1, 3, 2] inner_dims_pos = [3] inner_tiles = [8] into %17 : tensor<1x3x6x32xi32> -> tensor<1x3x4x6x8xi32>
      %alloc_8 = memref.alloc() : memref<3x3x4x1x8x4xi32, 2 : i32>
      %18 = bufferization.to_tensor %alloc_8 restrict writable : memref<3x3x4x1x8x4xi32, 2 : i32>
      %pack_9 = tensor.pack %extracted_slice_5 outer_dims_perm = [0, 1, 2, 3] inner_dims_pos = [2, 3] inner_tiles = [8, 4] into %18 : tensor<3x3x32x4xi32> -> tensor<3x3x4x1x8x4xi32>
      %alloc_10 = memref.alloc() : memref<1x1x4x1x4xi32, 2 : i32>
      %19 = bufferization.to_tensor %alloc_10 restrict writable : memref<1x1x4x1x4xi32, 2 : i32>
      %20 = linalg.fill ins(%c0_i32 : i32) outs(%19 : tensor<1x1x4x1x4xi32>) -> tensor<1x1x4x1x4xi32>
      %21 = scf.for %arg9 = %c0 to %c3 step %c1 iter_args(%arg10 = %20) -> (tensor<1x1x4x1x4xi32>) {
        %22 = scf.for %arg11 = %c0 to %c3 step %c1 iter_args(%arg12 = %arg10) -> (tensor<1x1x4x1x4xi32>) {
          %23 = scf.for %arg13 = %c0 to %c4 step %c1 iter_args(%arg14 = %arg12) -> (tensor<1x1x4x1x4xi32>) {
            %extracted_slice_11 = tensor.extract_slice %pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
            %extracted_slice_12 = tensor.extract_slice %pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
            %24 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction", "parallel", "reduction"]} ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>) attrs =  {lowering_config = #config, packing_config = #packingConfig} {
            ^bb0(%in: i32, %in_13: i32, %out: i32):
              %25 = arith.muli %in, %in_13 : i32
              %26 = arith.addi %out, %25 : i32
              linalg.yield %26 : i32
            } -> tensor<1x1x4x1x4xi32>
            scf.yield %24 : tensor<1x1x4x1x4xi32>
          }
          scf.yield %23 : tensor<1x1x4x1x4xi32>
        }
        scf.yield %22 : tensor<1x1x4x1x4xi32>
      }
      %unpack = tensor.unpack %21 inner_dims_pos = [3] inner_tiles = [4] into %extracted_slice_6 : tensor<1x1x4x1x4xi32> -> tensor<1x1x4x4xi32>
      memref.dealloc %alloc_7 : memref<1x3x4x6x8xi32, 2 : i32>
      memref.dealloc %alloc_8 : memref<3x3x4x1x8x4xi32, 2 : i32>
      memref.dealloc %alloc_10 : memref<1x1x4x1x4xi32, 2 : i32>
      scf.forall.in_parallel {
        tensor.parallel_insert_slice %unpack into %arg8[%arg4, %arg5, %arg6, %arg7] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<1x1x4x4xi32> into tensor<2x4x4x4xi32>
      }
    } {mapping = [#gpu.thread<y>, #gpu.thread<x>, #gpu.thread<z>, #gpu.thread<linear_dim_0>]}
    %16 = linalg.copy ins(%15 : tensor<2x4x4x4xi32>) outs(%extracted_slice_1 : tensor<2x4x4x4xi32>) -> tensor<2x4x4x4xi32>
    memref.dealloc %alloc : memref<2x6x6x32xi32, 1 : i32>
    memref.dealloc %alloc_2 : memref<3x3x32x4xi32, 1 : i32>
    memref.dealloc %alloc_3 : memref<2x4x4x4xi32, 1 : i32>
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %16 into %arg3[0, %arg0, %arg1, %arg2] [2, 4, 4, 4] [1, 1, 1, 1] : tensor<2x4x4x4xi32> into tensor<2x12x12x64xi32>
    }
  } {mapping = [#gpu.block<y>, #gpu.block<x>, #gpu.block<z>]}
  flow.dispatch.tensor.store %6, %2, offsets = [0, 0, 0, 0], sizes = [2, 12, 12, 64], strides = [1, 1, 1, 1] : tensor<2x12x12x64xi32> -> !flow.dispatch.tensor<writeonly:tensor<2x12x12x64xi32>>
  return
}

So I currently think we don't need any modification to upstream MLIR.

I'll post a PR with my packing config later. It currently results in other issues cropping up (in air: out of tile memory. in objectFifo: some other crash).

yzhang93 commented 2 months ago

I think the main concern is whether the generated date layout can be vectorized.

And the linalg.generic's operands seem to be problematic to me. ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)

newling commented 2 months ago

I think the main concern is whether the generated date layout can be vectorized. ? And the linalg.generic's operands seem to be problematic to me. ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>) outs(%arg14 : tensor<1x1x4x1x4xi32>)

It looks ok to me, can you explain your concern? I think it's just a matmul with m=n=4 k=8 on contiguous slices of L1 allocations.

yzhang93 commented 2 months ago

I didn't see the indexing maps for the linalg.generic operands, so not sure what has been done. Is there an implicit collapse of dimension for tensor<1x1x1x1x8x4xi32>?

newling commented 2 months ago

Oops, here are the maps:

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1 + d4, d6, d2 + d5, d8)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d4, d5, d6, d3, d8, d7)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d2, d3, d7)>

As an aside, note that %extracted_slice_11 and extra %extracted_slice_12 are contiguous slices, which is exactly what our motivation for packing was in the first place. So that's good.

Now let's look more closely at the linalg.generic (I'm going to try and convince you that it's just a matmul...).

Copying the inner loop from before in a more readable way:

%extracted_slice_11 = tensor.extract_slice 
                                    %pack[0, %arg9, %arg13, %arg11, 0] [1, 1, 1, 4, 8] [1, 1, 1, 1, 1] : 
                                    tensor<1x3x4x6x8xi32> to tensor<1x1x1x4x8xi32>
%extracted_slice_12 = tensor.extract_slice 
                                    %pack_9[%arg9, %arg11, %arg13, 0, 0, 0] [1, 1, 1, 1, 8, 4] [1, 1, 1, 1, 1, 1] : 
                                    tensor<3x3x4x1x8x4xi32> to tensor<1x1x1x1x8x4xi32>
%24 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
         iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
         outs(%arg14 : tensor<1x1x4x1x4xi32>) 
         attrs =  {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
  %25 = arith.muli %in, %in_13 : i32
  %26 = arith.addi %out, %25 : i32
  linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>

What are loop counts for each of the dimension d0 through d8? Matching the dimensions to the tensor we see:

d0: 1 (first dimension of %arg14 is size 1)
d1: 1 (second dimension of %arg14 is size 1)
d2: 4 (third dimension of %arg14 is size 4)
d3: 1 (fourth dimension of %arg14 is size 1)
d4: 1 (first dimension of %extracted_slice_12)
d5: 1 (second dimension of %extracted_slice_12)
d6: 1 (third dimension of %extracted_slice_12)
d7: 4 (final dimension of %arg14)
d8: 8 (final dimension of %extracted_slice_11)

What's interesting here is that d4 and d5 have loop count one. So #map is actually a trivial map because it is effectively just

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7, d8) -> (d0, d1, d6, d2, d8)>

newling commented 2 months ago

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

newling commented 2 months ago

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...

MaheshRavishankar commented 2 months ago

I think I understand what is happening here. Can you post the method you are using to drop the unit-dimensions. There is an upstream method that allows you to drop unit dimensions and also control what dimensions are being dropped. If you are using this op

%24 = linalg.generic {indexing_maps = [#map, #map1, #map2], 
         iterator_types = ["p", "p", "p", "p", "r", "r", "r", "p", "r"]} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x1x1x4x8xi32>, tensor<1x1x1x1x8x4xi32>)
         outs(%arg14 : tensor<1x1x4x1x4xi32>) 
         attrs =  {lowering_config = #config, packing_config = #packingconfig} {
^bb0(%in: i32, %in_13: i32, %out: i32):
  %25 = arith.muli %in, %in_13 : i32
  %26 = arith.addi %out, %25 : i32
  linalg.yield %26 : i32
} -> tensor<1x1x4x1x4xi32>

dropping the inner unit-dimension of tensor<1x1x4x1x4xi32> is probably causing the issue. You should be able to control the dimensions that you drop. But before that, Why is the result not tensor<1x1x1x4x4xi32>

MaheshRavishankar commented 2 months ago

Or if you post the IR post vectorization, that will give some clues

yzhang93 commented 2 months ago

So I think the one remaining problem for vectorization is to get the compiler to canonicalize this linalg.generic and then recognise that it's just a matmul.

Update on this: I have a WIP pass which eliminates the singleton dimensions so that after the pass the linalg.generic is clearly a matmul, but vectorization now introduces a broadcast before the vector.contract. Investigating...

I think previously it had compilation issue when lowered to vector.broadcast, but it's good to check if aievec can handle vector.broadcast now.

newling commented 2 months ago

Just noticed your comment @MaheshRavishankar and linalg-fold-unit-extent-dims works perfectly, thank you!

The pass I've written is basically the same as linalg-fold-unit-extent-dims but uses tensor.extract_slice instead of tensor.expand_shape, and for some reason comprehensive bufferization fails with the extract_slice approach.

Removing all unit dimensions is exactly what I want. It isn't enough to just remove the reduction dimensions, because then the broadcasts doesn't get eliminated (vector.contract verifies that all dimensions to appear in either LHS or RHS).

nod-ai / iree-amd-aie

[Tiling] Packing for convolution #756