nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
46 stars 23 forks source link

Supporting matmuls with "odd" dimensions #401

Open newling opened 3 weeks ago

newling commented 3 weeks ago

This issue discusses support for vectorized matmul (and other vectorized operations) for shapes which are not tiled by the vector width.

Consider for example

%C = linalg.matmul(%A, %B) : tensor<13x17xbf16>, tensor<17x19xbf16> -> tensor<13x19xf32>

The target atomic size of a matmul on AIE2 is m=n=4, k=8. So this matmul with m=13, n=19, k=17 is not tiled by the target atomic size, and the matmul cannot be handled by the vectorized matmul without some additional handling.

Approach 1: global padding:

The approach to handle this is to pad the input tensors to the next multiple of the target atomic size. This can be done upfront:

%A_padded = my.pad_with_zero %A : tensor<13x17xbf16> to tensor<16x24xbf16>
%B_padded = my.pad_with_zero %B : tensor<17x19xbf16> to tensor<24x20xbf16>
%C_padded = my.matmul %A_padded, %B_padded : tensor<16x24xbf16>, tensor<24x20xbf16> -> tensor<16x20xf32>
%C = my.slice %C_padded : tensor<16x20xf32> to tensor<13x19xf32>

Above we are incurring an O(mk + kn) overhead for the padding and slicing operations, as well as potentially expensive O(1) costs for having multiple kernels which need to be run. However, sometimes the padding operations can be merged with operations with produce the operands (and the slice operation)? TODO: more details.

IREE has some support for this, links have been shared by @nirvedhmeshram and Stan:

https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp https://github.com/MaheshRavishankar/iree/blob/dfa593104cc539abfcbf94572a6166eb79c5f413/compiler/src/iree/compiler/Preprocessing/Common/PadToIntrinsics.cpp#L1

in response to similar discussions (see for example https://discord.com/channels/689900678990135345/1169307503633383664/1232760520575029309).

Approach 2: padding in DMA:

The AIE has the option to pad in the MM2S channels of Memory Tiles. See:

https://docs.amd.com/r/en-US/am020-versal-aie-ml/AIE-ML-Memory-Tile-Memory

I don't think this is useful for this use case, because we want to pad in the MM2S of the Shim Tile, not the MM2S of the memtile.

Approach 3: Handle edges separately:

In approach 1 we padded the inputs, but an alternative is to slice the edges of the inputs and handle them separately (thus not padding whole tensors, just padding the sliced off edges). Won't work in the k-dimension.

newling commented 3 weeks ago

@jtuyls please feel free to update or add more info