Proposed steps for getting convolutions using matmul ukernel

newling commented 1 month ago

Currently our conv tiling creates something like:

%20 = scf.for %arg7 = %c0 to %c3 step %c1 iter_args(%arg8 = %19) -> (tensor<1x1x4x4xf32>) {
  %22 = scf.for %arg9 = %c0 to %c3 step %c1 iter_args(%arg10 = %arg8) -> (tensor<1x1x4x4xf32>) {
    %23 = scf.for %arg11 = %c0 to %c32 step %c8 iter_args(%arg12 = %arg10) -> (tensor<1x1x4x4xf32>) {
      %extracted_slice_7 = tensor.extract_slice %extracted_slice_4[0, %arg7, %arg9, %arg11] [1, 1, 4, 8] [1, 1, 1, 1] : 
          tensor<1x3x6x32xbf16> to tensor<1x1x4x8xbf16>
      %extracted_slice_8 = tensor.extract_slice %12[%arg7, %arg9, %arg11, 0] [1, 1, 8, 4] [1, 1, 1, 1] : tensor<3x3x32x4xbf16> to tensor<1x1x8x4xbf16>
      %24 = bufferization.alloc_tensor() : tensor<1x1x4x8xbf16>
      %alloc_9 = memref.alloc() : memref<1x1x4x8xbf16, 2 : i32>
      %25 = bufferization.to_tensor %alloc_9 restrict writable : memref<1x1x4x8xbf16, 2 : i32>
      %26 = linalg.copy ins(%extracted_slice_7 : tensor<1x1x4x8xbf16>) outs(%25 : tensor<1x1x4x8xbf16>) -> tensor<1x1x4x8xbf16>
      %27 = bufferization.alloc_tensor() : tensor<1x1x8x4xbf16>
      %alloc_10 = memref.alloc() : memref<1x1x8x4xbf16, 2 : i32>
      %28 = bufferization.to_tensor %alloc_10 restrict writable : memref<1x1x8x4xbf16, 2 : i32>
      %29 = linalg.copy ins(%extracted_slice_8 : tensor<1x1x8x4xbf16>) outs(%28 : tensor<1x1x8x4xbf16>) -> tensor<1x1x8x4xbf16>
      %extracted_slice_11 = tensor.extract_slice %26[0, 0, 0, 0] [1, 1, 4, 8] [1, 1, 1, 1] : tensor<1x1x4x8xbf16> to tensor<1x4x8xbf16>
      %extracted_slice_12 = tensor.extract_slice %29[0, 0, 0, 0] [1, 1, 8, 4] [1, 1, 1, 1] : tensor<1x1x8x4xbf16> to tensor<1x8x4xbf16>
      %extracted_slice_13 = tensor.extract_slice %arg12[0, 0, 0, 0] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<1x1x4x4xf32> to tensor<1x4x4xf32>
      %30 = linalg.conv_1d_nwc_wcf {dilations = dense<1> : vector<1xi64>, strides = dense<1> : vector<1xi64>} 
         ins(%extracted_slice_11, %extracted_slice_12 : tensor<1x4x8xbf16>, tensor<1x8x4xbf16>) 
         outs(%extracted_slice_13 : tensor<1x4x4xf32>) -> tensor<1x4x4xf32>
      %inserted_slice = tensor.insert_slice %30 into %arg12[0, 0, 0, 0] [1, 1, 4, 4] [1, 1, 1, 1] : tensor<1x4x4xf32> into tensor<1x1x4x4xf32>
      %31 = linalg.copy ins(%inserted_slice : tensor<1x1x4x4xf32>) outs(%arg12 : tensor<1x1x4x4xf32>) -> tensor<1x1x4x4xf32>
      memref.dealloc %alloc_9 : memref<1x1x4x8xbf16, 2 : i32>
      memref.dealloc %alloc_10 : memref<1x1x8x4xbf16, 2 : i32>
      scf.yield %31 : tensor<1x1x4x4xf32>
    }
    scf.yield %23 : tensor<1x1x4x4xf32>
  }
  scf.yield %22 : tensor<1x1x4x4xf32>
}

Basically we what the linalg.conv_1d_nwc_wcf in the inner-most loop to get converted into a call to a matmul microkernel. How can we achieve this?

Step 1) The linalg.conv_1d_nwc_wcf can be converted into a linalg.matmul or linalg.generic, which the pass AMDAIELowerToUKernels will subsequently match on. Note that the linalg.generic form generated with FailureOr<linalg::GenericOp> g = generalizeNamedOp(rewriter, op); needs further massaging to get the affine.maps into a canonical matmul form (there's a singleton reduction dimension), if we use that approach. Alternatively, we can match directly on the conv_1d with spatial dimension of size 1, and convert it to a linalg.matmul.

Step 2) Currently if we just do step 1, we can compile, but iree-run-module will crash, because the ukernel that gets inserted expects operands of size 64x64 and 64x64 respectively, and there are no safety rails to check that this is the case (there is no check on the size of the matmul), so bad things happen at runtime. The matmul we'd get from the above tiling has operands of shape 4x8 and 8x4. We must therefore either (maybe both):

i) Make the ukernel approach more flexible (so that is accepts different shaped matmuls) ii) Make the tiling of the convolution result in larger matmuls (ideally 64x64x64 if we don't want to do (i))

For (i) we need to understand/own/improve microkernel generation code here: https://github.com/Xilinx/mlir-aie/blob/main/aie_kernels/mm.cc

For (ii) we need to modify the C++ pipeline, implemented here: https://github.com/nod-ai/iree-amd-aie/blob/9f809c6c27730894275919347f2c44ff40e4a505/compiler/plugins/target/AMD-AIE/iree-amd-aie/Transforms/Passes.cpp#L426

The pipeline is extremely conservative in L1 -- it is using much less memory than the data memory of the AIE core has (64 kB I think). So larger tiling in L2 should be explored.

How can we achieve a 64x64x64 matmul though? One of these 64's corresponds to the input channel (reduction dimension) on corresponds to the output channel, and one must probably correspond to a spatial dimension. This seems tricky.

erwei-xilinx commented 1 month ago

Thanks for the great summary. I was also lookign into this topic yesterday. I did a braindead ukernel conversion of what IREE-AMD-AIE is tililing today https://github.com/Xilinx/mlir-air/pull/664 where I collected the trace out of it. The pipeline is not great... trace_slow.json

erwei-xilinx commented 1 month ago

Then I did a hand conversion by having L1 buffering all of inputs and weights needed for the whole of the for loop nest, and rewrote the ukernel to compute for the entire loop nest, and did a chess_prepare_for_pipeline pragma, the pipeline improved: trace.json

newling commented 1 month ago

Thanks Erwei, didn't realize you were already looking at this! It's hard to tell (for me) looking at the profile Perfetto what the performance is like, but I can guess that just doing a single 4x8 8x8 matmul is not going to be good (ie trace_slow.json). For the second version, so you basically changed the IR so that inside the loops

      scf.for %arg5 = %c0 to %c3 step %c1 {
         scf.for %arg6 = %c0 to %c3 step %c1 {
           scf.for %arg7 = %c0 to %c32 step %c8 {
              ...      
           }
        }
     }

there is absolutely no copying between L2 and L1? Nice that it fits. I suppose we could make that the default tiling in IREE already?

erwei-xilinx commented 1 month ago

there is absolutely no copying between L2 and L1? Nice that it fits. I suppose we could make that the default tiling in IREE already?

Yes, basically instead of buffering a single vector op's inputs in L1, I converted to buffering all inputs needed for the for loop nest. It's basically buffering the img2col-expanded input image of each L2 tile in L1 memory. Still only taking 1152 i8 numbers which is really small and under utilizing the L1 buffer size.

It's kind of not trivial to do in Transform Dialect though, because it appears that L1 bufferization is driven by the consumer which is the L2->L1 memcpy, which is driven by the tile-using-for. So without this for loop nest, the data movement doesn't really pack the data layout to give the vectorized inputs as innermost. Pack operations for L2-to-L1 just like pad-pack, maybe?

nod-ai / iree-amd-aie

Proposed steps for getting convolutions using matmul ukernel #529