Towards vectorized convolution (first PR)

newling commented 3 weeks ago

This PR does 2 things:

1) moves aievec lowering before scf-to-cf, because computing the alignment information needed to support convolution vectorizion in cf is more difficult than in scf.

2) Makes flattening of transfer_read ops more "aggressive". Large c&p from upstream MLIR here, sorry. Read on for more info.

The current convolution workflow ends up with core code to load from the input image that looks like:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %0 = vector.transfer_read 
         %reinterpret_cast_54[%c0, %arg1, %arg3, %arg2, %c0], %cst 
         {in_bounds = [true, true]} : memref<1x3x4x6x8xbf16>, vector<4x8xbf16>
      ... 
    }
  }
}

What is the alignment of the transfer_read above? In other words, what is the highest multiple of 2 that divides the offset of the transfer_read for all values of %arg1, %arg2, and %arg3? It is 16 bytes (consider %arg2=1).

This small alignment is a problem for the AIE instruction set, and without a change to the IR, the transfer_read will lower to inefficient, scalar code. Actually it's currently even worse than that -- it isn't correctly scalarized and we see numerical errors for basic convolutions unless we disable vectorization (slack peano channel discussion). We need 32 byte alignment.

The solution that the aievec dialect/project has hit upon is implemented in the following pattern: https://github.com/Xilinx/mlir-aie/blob/9fe5fb5386dbf087aca9bfba3815cd5bfa56d80d/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L119

The pattern converts an unaligned transfer_read into an aligned transfer_read of twice the length, followed by a vector.extract_strided_slice operation. For our convolution example, we therefore want to transfer_read a vector of 64 bf16 elements, and then extract the 32 bf16 elements that we actually want.

Clearly, a transfer_read of 64 elements from something of type memref<1x3x4x6x8xbf16> is not possible (because 6*8 = 48, which does not divide 64). We need some flattening. After running the upstream FlattenContiguousRowMajorTransfer pass, the memref is flattened as follows:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %collapse_shape = memref.collapse_shape %reinterpret_cast_54 [[0], [1], [2], [3, 4]] : memref<1x3x4x6x8xbf16> into memref<1x3x4x48xbf16>
      %0 = affine.apply affine_map<()[s0] -> (s0 * 8)>()[%arg2]
      %1 = vector.transfer_read %collapse_shape[%c0, %arg1, %arg3, %0], %cst {in_bounds = [true]} : memref<1x3x4x48xbf16>, vector<32xbf16>
      ...
    }
  }
}

but this is still insufficient, as the innermost dimension 48 still does not divide 64. This PR therefore makes the flattening more aggressive, so that we get IR like:

scf.for %arg1 = %c0 to %c3 step %c1 {
  scf.for %arg2 = %c0 to %c3 step %c1 {
    scf.for %arg3 = %c0 to %c4 step %c1 {
      %collapse_shape = memref.collapse_shape %reinterpret_cast_51 [[0, 1, 2, 3, 4]] : memref<1x3x4x6x8xbf16> into memref<576xbf16>
      %0 = affine.apply affine_map<()[s0, s1, s2] -> (s0 * 192 + s1 * 48 + s2 * 8)>()[%arg1, %arg3, %arg2]
      %1 = vector.transfer_read %collapse_shape[%0], %cst {in_bounds = [true]} : memref<576xbf16>, vector<32xbf16>
    }
  }
}

With IR like this we will be able to perform the aievec trick linked to above (future PR).

jtuyls commented 3 weeks ago

We need 64 byte alignment.

The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.

newling commented 3 weeks ago

We need 64 byte alignment.

The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.

You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those.

Let me try and explain with a toy model... consider 8 bytes in tile memory:

01234567

from which we want to put bytes 12 into a register. Suppose that the hardware constrains us to start transfers from memory to registers at even bytes. The aievec trick is to to transfer_read 0123 into a (larger) register and then in a subsequent step extract the 12 into a smaller register. To instructions for this second step at the HW level are

1) 2 extracts (0123 -> 01 and 23) https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html 2) one shift (concats top bits from 01 and bottom bits from 23) https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html

jtuyls commented 3 weeks ago

We need 64 byte alignment.

The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.

You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those.

Let me try and explain with a toy model... consider 8 bytes in tile memory:

01234567

from which we want to put bytes 12 into a register. Suppose that the hardware constrains us to start transfers from memory to registers at even bytes. The aievec trick is to to transfer_read 0123 into a (larger) register and then in a subsequent step extract the 12 into a smaller register. To instructions for this second step at the HW level are

2 extracts (0123 -> 01 and 23) https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html

one shift (concats top bits from 01 and bottom bits from 23) https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html

Thanks, that makes sense now.

nod-ai / iree-amd-aie

Towards vectorized convolution (first PR) #864