Closed newling closed 3 weeks ago
We need 64 byte alignment.
The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.
We need 64 byte alignment.
The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.
You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those.
Let me try and explain with a toy model... consider 8 bytes in tile memory:
01234567
from which we want to put bytes 12
into a register. Suppose that the hardware constrains us to start transfers from memory to registers at even bytes. The aievec trick is to to transfer_read 0123
into a (larger) register and then in a subsequent step extract the 12
into a smaller register. To instructions for this second step at the HW level are
1) 2 extracts (0123
-> 01
and 23
)
https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html
2) one shift (concats top bits from 01
and bottom bits from 23
)
https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html
We need 64 byte alignment.
The memory loads are 256 bit == 32 byte, so why do we need 64 byte alignment? I read the peano slack, but this still isn't clear to me.
You're right, we need 32 byte alignment. I've updated the incorrect 2 characters in the summary. It doesn't change the reasoning. We still want to transfer_read 64 bytes, and then extract the 32 bytes we want from those.
Let me try and explain with a toy model... consider 8 bytes in tile memory:
01234567
from which we want to put bytes
12
into a register. Suppose that the hardware constrains us to start transfers from memory to registers at even bytes. The aievec trick is to to transfer_read0123
into a (larger) register and then in a subsequent step extract the12
into a smaller register. To instructions for this second step at the HW level are
- 2 extracts (
0123
->01
and23
) https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorconv__elem.html- one shift (concats top bits from
01
and bottom bits from23
) https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__shift.html
Thanks, that makes sense now.
This PR does 2 things:
1) moves aievec lowering before scf-to-cf, because computing the alignment information needed to support convolution vectorizion in cf is more difficult than in scf.
2) Makes flattening of
transfer_read
ops more "aggressive". Large c&p from upstream MLIR here, sorry. Read on for more info.The current convolution workflow ends up with core code to load from the input image that looks like:
What is the alignment of the transfer_read above? In other words, what is the highest multiple of 2 that divides the offset of the transfer_read for all values of %arg1, %arg2, and %arg3? It is 16 bytes (consider %arg2=1).
This small alignment is a problem for the AIE instruction set, and without a change to the IR, the transfer_read will lower to inefficient, scalar code. Actually it's currently even worse than that -- it isn't correctly scalarized and we see numerical errors for basic convolutions unless we disable vectorization (slack peano channel discussion). We need 32 byte alignment.
The solution that the aievec dialect/project has hit upon is implemented in the following pattern: https://github.com/Xilinx/mlir-aie/blob/9fe5fb5386dbf087aca9bfba3815cd5bfa56d80d/lib/Dialect/AIEVec/Transforms/VectorToVectorConversions.cpp#L119
The pattern converts an unaligned transfer_read into an aligned transfer_read of twice the length, followed by a vector.extract_strided_slice operation. For our convolution example, we therefore want to transfer_read a vector of 64 bf16 elements, and then extract the 32 bf16 elements that we actually want.
Clearly, a
transfer_read
of 64 elements from something of typememref<1x3x4x6x8xbf16>
is not possible (because 6*8 = 48, which does not divide 64). We need some flattening. After running the upstreamFlattenContiguousRowMajorTransfer
pass, the memref is flattened as follows:but this is still insufficient, as the innermost dimension 48 still does not divide 64. This PR therefore makes the flattening more aggressive, so that we get IR like:
With IR like this we will be able to perform the aievec trick linked to above (future PR).