Supporting matmul tranpose variants

Support for all variants of matmul tranpose

We should support all transpose-variants of matmul / batch_matmul / GEMM: matmul(A, B), matmul(A, B.T), matmul(A.T, B), matmul(A.T, B.T)

Only one of these variants has "native" support in the AIE core, i.e. the intrinsic expects a specific layout for A and B.

We therefore need to transpose data before explicitly. Either before the matmul, on the fly in DMA, and/or in the core.

For matmul(A, B.T), on the core we want to perform a matmul with a chunk of A

[[ A00, A01, A02, A03, A04, A05, A06, A07], [ A10, A11, A12, A13, A14, A15, A16, A17], [ A20, A21, A22, A23, A24, A25, A26, A27], [ A30, A31, A32, A33, A34, A35, A36, A37]]

and a chunk of B:

[[ B00, B01, B02, B03, B04, B05, B06, B07], [ B10, B11, B12, B13, B14, B15, B16, B17], [ B20, B21, B22, B23, B24, B25, B26, B27], [ B30, B31, B32, B33, B34, B35, B36, B37]]

If we don't do any transposing of B in DMAs, the 32 values for the B matrix arrive on the core as

(i) [B00, B01, B02, B03, B04, B05, B06, B07, B10, B11, B12, B13, B14, B15, B16, B17, B20, B21, B22, B23, B24, B25, B26, B27, B30, B31, B32, B33, B34, B35, B36, B37]

For the matmul instruction for bf16 on AIE2 I think the expected layout is not transposed, i.e. it must be as follows in memory: (ii) [B00, B10, B20, B30, B01, B11, B21, B31, B02, B12, B22, B32, B03, B13, B23, B33, B04, B14, B24, B34, B05, B15, B25, B35, B06, B16, B26, B36, B07, B17, B27, B37]

(Note that this might change in future architectures, i.e. the matmul intrinsic might expect B to be in the transpose layout already). To be confirmed @erwei-xilinx

We cannot get to the layout (ii) for bfloat16 as DMAs can't split 32-bit elements (and shouldn't even split 128-bit elements if you want to use the full DMA bandwidth).

So what is the best order we can deliver B to the core in to minimize the overhead of rearrangement that the core must do? For example, we could deliver B as

(iii) [B00, B01, B10, B11, B20, B21, B30, B31, B02, B03, B12, B13, B22, B23, B32, B33, B04, B05, B14, B15, B24, B25, B34, B35, B06, B07, B16, B17, B26, B27, B36, B37]

But would that be better than delivering B as (i)? Currently we deliver it as (i), and aievec handles this by rearranging the B matrix in the core using a single transpose (shuffle) intrinsic. See https://www.xilinx.com/htmldocs/xilinx2023_2/aiengine_ml_intrinsics/intrinsics/group__intr__gpvectorop__interleave.html#details . We haven't done any performance analysis on this approach. There might be benefit if performing the transpose on larger tiles (see for example https://gitenterprise.xilinx.com/AIELibs/mllib/pull/581/files). Could somehow amortize the cost of doing the transpose (cc @jsetoain)

A completely alternative approach would be to perform all necessary / optimizing tranposes before the matmul. These transposes might be mergeable with the producers of the operands. TODO: find out if there is any support for this transformation in IREE.

We should probably support both approaches: before matmul AND during matmul.

cc @jtuyls thoughts?

nod-ai / iree-amd-aie

Supporting matmul tranpose variants #402

Support for all variants of matmul tranpose