Open newling opened 5 months ago
@jtuyls please feel free to update or add more info
Adding some thoughts and additional information to the above approaches.
Approach 1: global padding:
The approach to handle this is to pad the input tensors to the next multiple of the target atomic size. This can be done upfront:
%A_padded = my.pad_with_zero %A : tensor<13x17xbf16> to tensor<16x24xbf16> %B_padded = my.pad_with_zero %B : tensor<17x19xbf16> to tensor<24x20xbf16> %C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32> %C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>
Above we are incurring an O(m_k + k_n) overhead for the padding and slicing operations, as well as potentially expensive O(1) costs for having multiple kernels which need to be run. However, sometimes the padding operations can be merged with operations with produce the operands (and the slice operation)? TODO: more details.
IREE has some support for this, links have been shared by @nirvedhmeshram and Stan:
Some remarks:
While in the worst case there are additional copies inserted, I think in a lot of cases the padding operations can be fused into the producer. This producer operation could be offloaded to the CPU, but it could also be offloaded to AIE and in that case the local AIE core processors can take care of padding and slicing the data in fast L1 memory. Consider for example two back-to-back matmuls with a ReLU in between:
%A_padded = my.pad_with_zero %A : tensor<13x17xbf16> to tensor<16x24xbf16>
%B_padded = my.pad_with_zero %B : tensor<17x19xbf16> to tensor<24x20xbf16>
%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>
%Relu= my.relu %C_padded : tensor<13x19xf32 to tensor<13x19xf32>
%Relu_padded = my.pad_with_zero %Relu : tensor<13x19xbf16> to tensor<16x24xbf16>
%B2_padded = my.pad_with_zero %B2 : tensor<17x19xbf16> to tensor<24x20xbf16>
%Out_padded = my.matmul %Relu_padded , %B2_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%Out = my.slice %Out_padded : tensor<16x20xf32> to tensor<13x19xf32>
While naively only the matmuls would be offloaded to AIE, with a bunch of pad/slice type operations in between going to CPU, the padding, ReLU and slicing operations could be optimized by:
This could result in two back-to-back AIE kernels:
%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>
%Relu= my.relu %C_padded : tensor<13x19xf32 to tensor<13x19xf32>
%Relu_padded = my.pad_with_zero %Relu : tensor<13x19xbf16> to tensor<16x24xbf16>
and
%B2_padded = my.pad_with_zero %B2 : tensor<17x19xbf16> to tensor<24x20xbf16>
%Out_padded = my.matmul %Relu_padded , %B2_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%Out = my.slice %Out_padded : tensor<16x20xf32> to tensor<13x19xf32>
Advantages
Disadvantages
Approach 2: padding in DMA:
The AIE has the option to pad in the MM2S channels of Memory Tiles. See:
https://docs.amd.com/r/en-US/am020-versal-aie-ml/AIE-ML-Memory-Tile-Memory
I don't think this is useful for this use case, because we want to pad in the MM2S of the Shim Tile, not the MM2S of the memtile.
The Shim tile and it's indeed not capable of padding, however, you could move the data to MemTile first and pad when transferring data from MemTile to AIE. However, there are still a couple of caveats to this approach:
strides: [1], sizes: [7], padding_after: [17]
(unless there is a fancy way of achieving this which I am not seeing).Advantages
Disadvantages
Approach 3: Handle edges separately:
In approach 1 we padded the inputs, but an alternative is to slice the edges of the inputs and handle them separately (thus not padding whole tensors, just padding the sliced off edges). Won't work in the k-dimension.
Could be combined with either approaches 1) or 2) and would therefore take the advantages/disadvantages of the respective approaches. However, this approach would add additional concat type operations to combine the result edge slices with the main result. As AIE isn't very well suited for this type of data copying operations, we would have to make sure that this concatenation would be fused into preceding operations.
Adding an additional approach:
This approach handles odd shapes in local memory by either:
Advantages
Disadvantages
Just saw this in-depth follow-up, thanks @jtuyls
My concern with approach 2 (and please correct me anyone if I'm wrong) is that it cannot work for large tensors. Suppose you're feeding tiles of size 128x128 from a tensor A of size 255x255 in DDR, to the memory tiles. In this case, you only perform padding on the right-most tiles (A[0:128, 128:255] and A[128:256, 128:255]). It isn't possible for the memory tile to selectively pad. It is sizes: [127], padding_after: [1]
(which we want for the rightmost tiles) or sizes: [128], padding_after: [0]
(which we want for the interior tiles) but it cannot do both.
Update: After speaking with @erwei-xilinx I think it is possible, but you need to use a different BD for the edges of the tensor. For static shapes this is possible.
This issue discusses support for vectorized matmul (and other vectorized operations) for shapes which are not tiled by the vector width.
Consider for example
The target atomic size of a matmul on AIE2 is m=n=4, k=8. So this matmul with m=13, n=19, k=17 is not tiled by the target atomic size, and the matmul cannot be handled by the vectorized matmul without some additional handling.
Approach 1: global padding:
The approach to handle this is to pad the input tensors to the next multiple of the target atomic size. This can be done upfront:
Above we are incurring an O(mk + kn) overhead for the padding and slicing operations, as well as potentially expensive O(1) costs for having multiple kernels which need to be run. However, sometimes the padding operations can be merged with operations with produce the operands (and the slice operation)? TODO: more details.
IREE has some support for this, links have been shared by @nirvedhmeshram and Stan:
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp https://github.com/MaheshRavishankar/iree/blob/dfa593104cc539abfcbf94572a6166eb79c5f413/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp#L1
in response to similar discussions (see for example https://discord.com/channels/689900678990135345/1169307503633383664/1232760520575029309).
Approach 2: padding in DMA:
The AIE has the option to pad in the MM2S channels of Memory Tiles. See:
https://docs.amd.com/r/en-US/am020-versal-aie-ml/AIE-ML-Memory-Tile-Memory
I don't think this is useful for this use case, because we want to pad in the MM2S of the Shim Tile, not the MM2S of the memtile.
Approach 3: Handle edges separately:
In approach 1 we padded the inputs, but an alternative is to slice the edges of the inputs and handle them separately (thus not padding whole tensors, just padding the sliced off edges). Won't work in the k-dimension.