Supporting matmuls with "odd" dimensions

newling commented 5 months ago

This issue discusses support for vectorized matmul (and other vectorized operations) for shapes which are not tiled by the vector width.

Consider for example

%C = linalg.matmul(%A, %B) : tensor<13x17xbf16>, tensor<17x19xbf16> -> tensor<13x19xf32>

The target atomic size of a matmul on AIE2 is m=n=4, k=8. So this matmul with m=13, n=19, k=17 is not tiled by the target atomic size, and the matmul cannot be handled by the vectorized matmul without some additional handling.

Approach 1: global padding:

The approach to handle this is to pad the input tensors to the next multiple of the target atomic size. This can be done upfront:

%A_padded = my.pad_with_zero %A : tensor<13x17xbf16> to tensor<16x24xbf16>
%B_padded = my.pad_with_zero %B : tensor<17x19xbf16> to tensor<24x20xbf16>
%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>

Above we are incurring an O(mk + kn) overhead for the padding and slicing operations, as well as potentially expensive O(1) costs for having multiple kernels which need to be run. However, sometimes the padding operations can be merged with operations with produce the operands (and the slice operation)? TODO: more details.

IREE has some support for this, links have been shared by @nirvedhmeshram and Stan:

https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp https://github.com/MaheshRavishankar/iree/blob/dfa593104cc539abfcbf94572a6166eb79c5f413/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp#L1

in response to similar discussions (see for example https://discord.com/channels/689900678990135345/1169307503633383664/1232760520575029309).

Approach 2: padding in DMA:

The AIE has the option to pad in the MM2S channels of Memory Tiles. See:

https://docs.amd.com/r/en-US/am020-versal-aie-ml/AIE-ML-Memory-Tile-Memory

I don't think this is useful for this use case, because we want to pad in the MM2S of the Shim Tile, not the MM2S of the memtile.

Approach 3: Handle edges separately:

In approach 1 we padded the inputs, but an alternative is to slice the edges of the inputs and handle them separately (thus not padding whole tensors, just padding the sliced off edges). Won't work in the k-dimension.

newling commented 5 months ago

@jtuyls please feel free to update or add more info

jtuyls commented 4 months ago

Adding some thoughts and additional information to the above approaches.

Approach 1: global padding:

The approach to handle this is to pad the input tensors to the next multiple of the target atomic size. This can be done upfront:
%A_padded = my.pad_with_zero %A : tensor<13x17xbf16> to tensor<16x24xbf16>
%B_padded = my.pad_with_zero %B : tensor<17x19xbf16> to tensor<24x20xbf16>
%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>
Above we are incurring an O(m_k + k_n) overhead for the padding and slicing operations, as well as potentially expensive O(1) costs for having multiple kernels which need to be run. However, sometimes the padding operations can be merged with operations with produce the operands (and the slice operation)? TODO: more details.

IREE has some support for this, links have been shared by @nirvedhmeshram and Stan:

Some remarks:

While in the worst case there are additional copies inserted, I think in a lot of cases the padding operations can be fused into the producer. This producer operation could be offloaded to the CPU, but it could also be offloaded to AIE and in that case the local AIE core processors can take care of padding and slicing the data in fast L1 memory. Consider for example two back-to-back matmuls with a ReLU in between:

%A_padded = my.pad_with_zero %A : tensor<13x17xbf16> to tensor<16x24xbf16>
%B_padded = my.pad_with_zero %B : tensor<17x19xbf16> to tensor<24x20xbf16>
%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>
%Relu= my.relu %C_padded : tensor<13x19xf32 to tensor<13x19xf32>
%Relu_padded = my.pad_with_zero %Relu : tensor<13x19xbf16> to tensor<16x24xbf16>
%B2_padded = my.pad_with_zero %B2 : tensor<17x19xbf16> to tensor<24x20xbf16>
%Out_padded = my.matmul %Relu_padded , %B2_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%Out = my.slice %Out_padded : tensor<16x20xf32> to tensor<13x19xf32>

While naively only the matmuls would be offloaded to AIE, with a bunch of pad/slice type operations in between going to CPU, the padding, ReLU and slicing operations could be optimized by:

A graph transformation which gets rid of the no-op pad-slice patterns.
Fusion of those operations into the preceding operations until they can be executed on local AIE memory.

This could result in two back-to-back AIE kernels:

%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>
%Relu= my.relu %C_padded : tensor<13x19xf32 to tensor<13x19xf32>
%Relu_padded = my.pad_with_zero %Relu : tensor<13x19xbf16> to tensor<16x24xbf16>

and

%B2_padded = my.pad_with_zero %B2 : tensor<17x19xbf16> to tensor<24x20xbf16>
%Out_padded = my.matmul %Relu_padded , %B2_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%Out = my.slice %Out_padded : tensor<16x20xf32> to tensor<13x19xf32>

Advantages

Simplest approach and will always work as we're relying on other hardware to pad the tensors to multiples of convenient AIE shapes. Gets around the limitation of DMAs only being able to transfer multiples of 32-bit words.
Might be quite performant if pad-slice no-ops can be eliminated and/or aggressively fused.

Disadvantages

Needs capable graph-level no-op elimination transformations and/or fusion strategies.
If pad-slice operations can't be eliminated and/or fused, this might result in a lot of back-and-forth between different devices (AIE-CPU), incurring expensive kernel launch costs.

Approach 2: padding in DMA:

The AIE has the option to pad in the MM2S channels of Memory Tiles. See:

https://docs.amd.com/r/en-US/am020-versal-aie-ml/AIE-ML-Memory-Tile-Memory

I don't think this is useful for this use case, because we want to pad in the MM2S of the Shim Tile, not the MM2S of the memtile.

The Shim tile and it's indeed not capable of padding, however, you could move the data to MemTile first and pad when transferring data from MemTile to AIE. However, there are still a couple of caveats to this approach:

Transferring data and padding can only happen in multiples of 32-bit words.
Padding has a limited number of bits (4), limiting it to 16 values. Therefore, padding in certain dimension happens in multiples of the previous dimension. This complicates padding + packing cases a lot, but there are even simple padding cases that might not be supported like strides: [1], sizes: [7], padding_after: [17] (unless there is a fancy way of achieving this which I am not seeing).

Advantages

Potentially most performant option if possible as the padding can be done by DMAs in parallel to core AIE processing.

Disadvantages

Hardware limitations limit the number of cases that can be handled (multiples of 32-bit words, limited number of bits and availability on only select DMAs).
Hardware limitations like availability on only select DMAs make it hard to compile high-level computations down to these HW configurations, typically resulting in a need for a lot a handholding and hardcoded logic to achieve correct and performant DMA configurations.

Approach 3: Handle edges separately:

In approach 1 we padded the inputs, but an alternative is to slice the edges of the inputs and handle them separately (thus not padding whole tensors, just padding the sliced off edges). Won't work in the k-dimension.

Could be combined with either approaches 1) or 2) and would therefore take the advantages/disadvantages of the respective approaches. However, this approach would add additional concat type operations to combine the result edge slices with the main result. As AIE isn't very well suited for this type of data copying operations, we would have to make sure that this concatenation would be fused into preceding operations.

Adding an additional approach:

Approach 4: Local handling of odd shapes

This approach handles odd shapes in local memory by either:

Having the AIE scalar processor pad the data before a computation.
Having an odd local AIE computation.

Advantages

Performance. All computations and padding/slicing can happen inside the AIE array. Note: odd local AIE computations might not be performant.

Disadvantages

Input/output tensors that are not multiples of 32 bit words can't be transferred into/out of the AIE array.
Complex local code generation due to the need to keep the local code performant (vector processor) + packing and padding combination.

newling commented 4 months ago

Just saw this in-depth follow-up, thanks @jtuyls

My concern with approach 2 (and please correct me anyone if I'm wrong) is that it cannot work for large tensors. Suppose you're feeding tiles of size 128x128 from a tensor A of size 255x255 in DDR, to the memory tiles. In this case, you only perform padding on the right-most tiles (A[0:128, 128:255] and A[128:256, 128:255]). It isn't possible for the memory tile to selectively pad. It is sizes: [127], padding_after: [1] (which we want for the rightmost tiles) or sizes: [128], padding_after: [0] (which we want for the interior tiles) but it cannot do both.

Update: After speaking with @erwei-xilinx I think it is possible, but you need to use a different BD for the edges of the tensor. For static shapes this is possible.

nod-ai / iree-amd-aie