[CPU] Enable DT for SVE for linalg.matmul

The level of support and performance for SVE and scalable vectors in IREE are quite encouraging. It's good time to start looking into Data Tiling.

[ ] Add tests for linalg.mmt4d in MLIR core (regression + e2e)
[ ] Lower linalg.mmt4d to SVE (through scalable vectorisation) - MLIR core
[ ] Unlock DT for SVE in IREE (in enumerateMatmulTileArm64) + add logic for "scalable flags".

Disclaimer: Somewhat related to https://github.com/openxla/iree/issues/14799, but I would like to look into DT first. Hence a dedicated ticket.

### Tasks

Supporting DT in the context of scalable vectors (SVE) and matrices (SME)

Key steps in Data Tiling that will require updating in order to support SVE or SME.

We shouldn't require any major/intrusive updates, hence keeping this overview fairly brief. Please let me know if you'd like me to expand and I can add more details.

0. Context + background

Here's a very brief primer on scalable vectors (starts at minute 10):

https://www.youtube.com/watch?v=jHk56V6cHN8&t=1s&ab_channel=LLVM

Note that we can only use "scalable" dims at the Vector type level. At the Tensor/MemRef type levels, we model "scalable" sizes using dynamic dimensions.

In the case of matrix multiplication (C += A * B), we would make the following dims scalable/dynamic:

SVE - N dim scalable (1 dim)
SME - N and M dims scalable (2 dims).

As usual, A is NxK, B is KxN and C is MxN.

1. Vectorising linalg.mmt4d

At the moment, "scalable" vectorisation in Linalg consists of two generic steps (*):

Tiling with scalable tile sizes (e.g. %c4 * vector.vscale) - this leads to tensors with dynamic shapes.
Masked vectorization - ATM that's the only available option for ops with dynamic shapes.

In order to support scalable vectorisation of linalg.mmt4d, we need to add:

support for masked vectorisation for linalg.mmt4d.

Since we only need the inner 2 dims to be scalable, I will assume that also for masking we should only consider the inner dims.

(*) Optionally, In cases where we can prove that every iteration uses "full" vectors, masks are folded away through canonicalisation patterns.

This is already work in progress

2. Vectorising tensor.pack/tensor.unpack (and tensor.pad)

Both tensor.pack and tensor.unpack support dynamic inner tiles - this should be sufficient for what we need. I've not attempted vectorising these Ops using scalable vectors, but from a quick investigation I don't see any obvious blockers.

One potential challenge is vector.transpose (generated during vectorisation) - this Op is a bit challenging in the context of scalable vectors as we don't really support vector shuffles. But we need to work around these limitations regardless of linalg.mmt4d and DT.

Not yet started

3. CPUMaterializeEncoding

IIUC, this is the earliest pass that will require updating (everything preceding this seems fairly generic). In particular, we will need a way to mark certain dimensions as scalable. I believe that's going to require a bit more than merely updating:

enumerateMatmulTileArm64

Instead, we will most likely require "tweaks" similar to what we've been adding in KernelDispatch.cpp (see e.g. vecScalableDims in setMatmulRootConfig). (*)

To give you a flavour of what's to come - this is what you'd get today for inputs with dynamic shapes (an abbreviated example. Note that only the outer dims are dynamic:

  %pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
  // Pack matrix B - N dim is static
  %pack_3 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [8, 1] into %12 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
  %pack_6 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %15 : tensor<?x?xf32> -> tensor<?x?x8x8xf32>
  %16 = linalg.mmt4d ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x8x1xf32>) outs(%pack_6 : tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32>
  %unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %17 : tensor<?x?x8x8xf32> -> tensor<?x?xf32>

Once we add scalable sizes to the mix, we will have something like this (outer dims are dynamic as well as the N dimension):

  %pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
  // Pack matrix B - N dim is dynamic
  %pack_3 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [[8], 1] into %12 : tensor<?x?xf32> -> tensor<?x?x?x1xf32>
  %pack_6 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, [8]] into %15 : tensor<?x?xf32> -> tensor<?x?x8x?xf32>
  %16 = linalg.mmt4d ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%pack_6 : tensor<?x?x?x8xf32>) -> tensor<?x?x?x8xf32>
  %unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, [8]] into %17 : tensor<?x?x8x?xf32> -> tensor<?x?xf32>

That's an example for SVE for which we'd make the N dim scalable. For SME, we would make both N and M dims scalable.

(*) Btw, we are trying to improve how "scalable flags" are represented, there's some discussion here:

https://github.com/openxla/iree/pull/16435

Not yet started

4. OutlineDispatchRegions

ATM, for DT applied to matrices with dynamic shapes, the outer tile sizes are added to the output/input params and passed between dispatch regions. Here's a dispatch region that's created for linalg.mmt4d:

func.func @pipeline_dispatch_3(
     %arg0: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
     %arg1: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
     %arg2: !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>, 
     %arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index) {

  %0 = flow.dispatch.workload.ordinal %arg3, 0 : index
  %1 = flow.dispatch.workload.ordinal %arg4, 1 : index
  %2 = flow.dispatch.workload.ordinal %arg5, 2 : index
  %3 = flow.dispatch.workload.ordinal %arg6, 3 : index
  %4 = flow.dispatch.workload.ordinal %arg7, 4 : index
  %5 = flow.dispatch.workload.ordinal %arg8, 5 : index
  %6 = flow.dispatch.tie_shape %arg0 : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%0, %1}
  %7 = flow.dispatch.tie_shape %arg1 : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%2, %3}
  %8 = flow.dispatch.tie_shape %arg2 : !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>{%4, %5}
  %9 = flow.dispatch.tensor.load %6, offsets = [0, 0, 0, 0], sizes = [%0, %1, 8, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%0, %1} -> tensor<?x?x8x1xf32>
  %10 = flow.dispatch.tensor.load %7, offsets = [0, 0, 0, 0], sizes = [%2, %3, 8, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%2, %3} -> tensor<?x?x8x1xf32>
  %11 = flow.dispatch.tensor.load %8, offsets = [0, 0, 0, 0], sizes = [%4, %5, 8, 8], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>{%4, %5} -> tensor<?x?x8x8xf32>
  %12 = linalg.mmt4d ins(%9, %10 : tensor<?x?x8x1xf32>, tensor<?x?x8x1xf32>) outs(%11 : tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32>
  flow.dispatch.tensor.store %12, %8, offsets = [0, 0, 0, 0], sizes = [%4, %5, 8, 8], strides = [1, 1, 1, 1] : tensor<?x?x8x8xf32> -> !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>{%4, %5}

  return
}

For "scalable" sizes (SVE), we will need to extend this so that also the inner tile size for matrix B (corresponding to dim N) is correctly propagated, e.g.:

func.func @pipeline_dispatch_3(
    %arg0: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
    %arg1: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
    %arg2: !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>, 
    %arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index, 
    // Additional parameter for the N dim
    %N: index) { 

  // (...)
  %7 = flow.dispatch.tie_shape %arg1 : !flow.dispatch.tensor<readonly:tensor<?x?x?x1xf32>>{%2, %3, %N}
  // (...)
  %10 = flow.dispatch.tensor.load %7, offsets = [0, 0, 0, 0], sizes = [%2, %3, %N, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x?x1xf32>>{%2, %3, %N} -> tensor<?x?x?x1xf32>
  // (...)
  %12 = linalg.mmt4d ins(%9, %10 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%11 : tensor<?x?x8x?xf32>) -> tensor<?x?x8x?xf32>
  flow.dispatch.tensor.store %12, %8, offsets = [0, 0, 0, 0], sizes = [%4, %5, 8, %N], strides = [1, 1, 1, 1] : tensor<?x?x8x?xf32> -> !flow.dispatch.tensor<readwrite:tensor<?x?x8x?xf32>>{%4, %5, %N}
  return
}

It feels like a fairly straightforward extension, but it's also the step that I understand the least. Hopefully I am not missing something fundamental.

Not yet started

Other notable changes

There's a couple of other elements that are going to be a bit tricky.

linalg.mmt4d assumes that in A*B, it's the B matrix (RHS) that's transposed. That makes sense for matmuls implemented using dot-product. However, for matmuls implemented using outer-products (that's what SME does), it's matrix A that's transposed (LHS). So either linalg.mmt4d needs to be updated to allow that or we need a new Op. Note that this only affects SME.
Tile size bounds are often calculated as (8 is the inner tile size): #map = affine_map<()[s0] -> (s0 ceildiv 8)>. For scalable vectors, we will need to replace 8 with %c8 * vector.vscale.This will probably complicate the generated IR.

Final words

Thanks for taking a look - have I missed anything? Your feedback is much appreciated 🙏🏻

I'll spend some time to study this later.

(cc @bjacob @MaheshRavishankar @Max191 @pashu123 )

Hey @banach-space . Thanks for the description. I skimmed through this once and have a few commints.

3. CPUMaterializeEncoding

IIUC, this is the earliest pass that will require updating (everything preceding this seems fairly generic). In particular, we will need a way to mark certain dimensions as scalable. I believe that's going to require a bit more than merely updating:

enumerateMatmulTileArm64

Instead, we will most likely require "tweaks" similar to what we've been adding in KernelDispatch.cpp (see e.g. vecScalableDims in setMatmulRootConfig). (*)

To give you a flavour of what's to come - this is what you'd get today for inputs with dynamic shapes (an abbreviated example. Note that only the outer dims are dynamic:
  %pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
  // Pack matrix B - N dim is static
  %pack_3 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [8, 1] into %12 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
  %pack_6 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %15 : tensor<?x?xf32> -> tensor<?x?x8x8xf32>
  %16 = linalg.mmt4d ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x8x1xf32>) outs(%pack_6 : tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32>
  %unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 8] into %17 : tensor<?x?x8x8xf32> -> tensor<?x?xf32>
Once we add scalable sizes to the mix, we will have something like this (outer dims are dynamic as well as the N dimension):
  %pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
  // Pack matrix B - N dim is dynamic
  %pack_3 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [[8], 1] into %12 : tensor<?x?xf32> -> tensor<?x?x?x1xf32>
  %pack_6 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, [8]] into %15 : tensor<?x?xf32> -> tensor<?x?x8x?xf32>
  %16 = linalg.mmt4d ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%pack_6 : tensor<?x?x?x8xf32>) -> tensor<?x?x?x8xf32>
  %unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, [8]] into %17 : tensor<?x?x8x?xf32> -> tensor<?x?xf32>
That's an example for SVE for which we'd make the N dim scalable. For SME, we would make both N and M dims scalable.

(*) Btw, we are trying to improve how "scalable flags" are represented, there's some discussion here:

[CPU] Simplify how tile sizes are updated #16435

Not yet started

I think the only thing I ask for here is that this be purely opt-in. Basically any body not working with scalable vectors or targeting such devices should have to worry about it. This is related to the discussion you pointed to earlier. The easiest thing I can think of is when we are materializing encodings, we look at whether SME/SVE are enabled on the target architecture (and maybe have another level of control to avoid using SME/SVE anyway) and use materialize encodings to insert the necessary constructs.

4. OutlineDispatchRegions

func.func @pipeline_dispatch_3(
     %arg0: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
     %arg1: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
     %arg2: !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>, 
     %arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index) {

  %0 = flow.dispatch.workload.ordinal %arg3, 0 : index
  %1 = flow.dispatch.workload.ordinal %arg4, 1 : index
  %2 = flow.dispatch.workload.ordinal %arg5, 2 : index
  %3 = flow.dispatch.workload.ordinal %arg6, 3 : index
  %4 = flow.dispatch.workload.ordinal %arg7, 4 : index
  %5 = flow.dispatch.workload.ordinal %arg8, 5 : index
  %6 = flow.dispatch.tie_shape %arg0 : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%0, %1}
  %7 = flow.dispatch.tie_shape %arg1 : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%2, %3}
  %8 = flow.dispatch.tie_shape %arg2 : !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>{%4, %5}
  %9 = flow.dispatch.tensor.load %6, offsets = [0, 0, 0, 0], sizes = [%0, %1, 8, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%0, %1} -> tensor<?x?x8x1xf32>
  %10 = flow.dispatch.tensor.load %7, offsets = [0, 0, 0, 0], sizes = [%2, %3, 8, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>{%2, %3} -> tensor<?x?x8x1xf32>
  %11 = flow.dispatch.tensor.load %8, offsets = [0, 0, 0, 0], sizes = [%4, %5, 8, 8], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>{%4, %5} -> tensor<?x?x8x8xf32>
  %12 = linalg.mmt4d ins(%9, %10 : tensor<?x?x8x1xf32>, tensor<?x?x8x1xf32>) outs(%11 : tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32>
  flow.dispatch.tensor.store %12, %8, offsets = [0, 0, 0, 0], sizes = [%4, %5, 8, 8], strides = [1, 1, 1, 1] : tensor<?x?x8x8xf32> -> !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>{%4, %5}

  return
}

For "scalable" sizes (SVE), we will need to extend this so that also the inner tile size for matrix B (corresponding to dim N) is correctly propagated, e.g.:

func.func @pipeline_dispatch_3(
    %arg0: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
    %arg1: !flow.dispatch.tensor<readonly:tensor<?x?x8x1xf32>>, 
    %arg2: !flow.dispatch.tensor<readwrite:tensor<?x?x8x8xf32>>, 
    %arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index, 
    // Additional parameter for the N dim
    %N: index) { 

  // (...)
  %7 = flow.dispatch.tie_shape %arg1 : !flow.dispatch.tensor<readonly:tensor<?x?x?x1xf32>>{%2, %3, %N}
  // (...)
  %10 = flow.dispatch.tensor.load %7, offsets = [0, 0, 0, 0], sizes = [%2, %3, %N, 1], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?x?x1xf32>>{%2, %3, %N} -> tensor<?x?x?x1xf32>
  // (...)
  %12 = linalg.mmt4d ins(%9, %10 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%11 : tensor<?x?x8x?xf32>) -> tensor<?x?x8x?xf32>
  flow.dispatch.tensor.store %12, %8, offsets = [0, 0, 0, 0], sizes = [%4, %5, 8, %N], strides = [1, 1, 1, 1] : tensor<?x?x8x?xf32> -> !flow.dispatch.tensor<readwrite:tensor<?x?x8x?xf32>>{%4, %5, %N}
  return
}

It feels like a fairly straightforward extension, but it's also the step that I understand the least. Hopefully I am not missing something fundamental.

We probably need to get a bit more idea of what is needed. All the dispatch region formation is basically done using "straight-forward" outlining. Any extra parameter you need should pretty much fall out of this process. So I think this part should basically be untouched by anything related to SME or SVE. We should talk about this in more details.

Not yet started

Other notable changes

There's a couple of other elements that are going to be a bit tricky.

linalg.mmt4d assumes that in A*B, it's the B matrix (RHS) that's transposed. That makes sense for matmuls implemented using dot-product. However, for matmuls implemented using outer-products (that's what SME does), it's matrix A that's transposed (LHS). So either linalg.mmt4d needs to be updated to allow that or we need a new Op. Note that this only affects SME.

This is probably a new op. A linalg.mtm4d?

Tile size bounds are often calculated as (8 is the inner tile size): #map = affine_map<()[s0] -> (s0 ceildiv 8)>. For scalable vectors, we will need to replace 8 with %c8 * vector.vscale.This will probably complicate the generated IR.

Similar to tiling, this should just fall out of changes during materialize encoding.

Final words

Thanks for taking a look - have I missed anything? Your feedback is much appreciated 🙏🏻

oh yes, sorry that I didn't spot this earlier. There is a misunderstanding about what mmt4d does.

linalg.mmt4d assumes that in A*B, it's the B matrix (RHS) that's transposed. That makes sense for matmuls implemented using dot-product. However, for matmuls implemented using outer-products (that's what SME does), it's matrix A that's transposed (LHS). So either linalg.mmt4d needs to be updated to allow that or we need a new Op. Note that this only affects SME.

mmt4d does not have any problem with outer products. The reason is that when an array has an effectively 1D shape (meaning, its formal shape may be of any rank, but all but one of its dimensions have unit size, e.g. 1x1x1x5x1), there is only one possible contiguous layout for it. For example, the LHS tile in a linalg.mmt4d is always a 2D shape, M0xK0. To say it's an outer-product is to say that K0 == 1. So then it's M0x1. Then it doesn't matter that in mmt4d the LHS is un-transposed row-major: on the M0x1 shape, which is essentially 1D in the above sense, the row-major and column-major layouts are the same thing.

In fact, mmt4d is used with outer-product kernels all the time -- SME is not special at all here, and if anything, outer-product kernels are the most common case. For example, on both Arm NEON and x86 AVX*, all the f32 kernels are outer-product, characterized by K0 == 1 in the triples enumerated here for f32 (the order is {M0, N0, K0} so e.g. {8, 8, 1} is an outer-product kernel): https://github.com/openxla/iree/blob/07a854cac43adf1e120d0e459497f8216568a747/compiler/src/iree/compiler/Codegen/Common/CPU/CPUMaterializeEncodingPass.cpp#L107-L112

To find a case that would defeat the current linalg.mmt4d model, you need to go to places where the LHS or RHS tile is not essentially 1D and the storage order is not row-major LHS, column-major RHS. Since these cases only arise as a result of design mistakes in the SIMD ISA[^1], they are uncommon, and in particular Arm's ISAs including SME are entirely fine here. An example of an ISA afflicted by the flaw that defeats linalg.mmt4d is Intel AMX.

[^1]: The combination of storage orders in Arm's ISAs including SME is the only one that scales by putting adjacent tiles along the M and N dimensions without increasing the layout dimensionality, so it is clearly the right thing to do in SIMD ISA design, and Arm SME (as well as NEON and SVE) should be a textbook example for other vendors to follow here.

Thanks for the feedback 🙏🏻

For those of you who missed it, we discussed the linalg.mmt4d part of this proposal in the mai-tai call yesterday (March 26th). Based on that conversation and Benoit's generous feedback above, I am realising that we probably (*) won't need to touch linalg.mmt4d and that the current abstraction is sufficient.

In fact, mmt4d is used with outer-product kernels all the time -- SME is not special at all here, and if anything, outer-product kernels are the most common case.

Yes, in this sense linalg.mmt4d is very similar to linalg.matmul and that's the key part that I missed when drafting my post. In fact, Vector dialect progressive lowering look like this:

linalg.mmt4d -> vector.multi_reduction -> vector.contract -> vector.outer_product.

This is similar to linalg.matmul (for which SME "just works" ™️ ).

@bjacob, thanks for pointing this out and for your kind comments on Arm's ISAs :)

Now ...

3. CPUMaterializeEncoding

I think the only thing I ask for here is that this be purely opt-in. Basically any body not working with scalable vectors or targeting such devices should have to worry about it. This is related to the discussion you pointed to earlier. The easiest thing I can think of is when we are materializing encodings, we look at whether SME/SVE are enabled on the target architecture (and maybe have another level of control to avoid using SME/SVE anyway) and use materialize encodings to insert the necessary constructs.

I read your comment as "try to avoid the approach taken in KernelDispatch.cpp" :) That should be possible in this case (this logic doesn't seem as generic as KernelDispatch.cpp). When targeting scalable vectors (e.g. SVE or SME), we would consider sizes in enumerateMatmulTileArm64 as base sizes that are to be multiplied by vector.vscale, e.g.:

%vscale = vector.vscale
%tile_size = arith.muli %vscale, %c8 : index
// Note the inner tile sizes
%pack_3 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [%tile_size, 1] into %12 : tensor<?x?xf32> -> tensor<?x?x?x1xf32>

Is that what you had in mind?

Thanks again for taking a look - I appreciate that this is quite dense.

-Andrzej

(*) I'm saying probably as today I can't really lower linalg.mmt4d to SVE (one scalable dim), let alone SME (two scalable dims) - I'd like to do that to confirm this 100%.

One potential challenge is vector.transpose (generated during vectorisation) - this Op is a bit challenging in the context of scalable vectors as we don't really support vector shuffles. But we need to work around these limitations regardless of linalg.mmt4d and DT.

We always need to handle the transpose for both DT and non-DT cases, so it is not a big concern to me. The transpose is introduced (for LHS) because you want it to be in that layout. Assuming that is also the layout you pre-pack in non-DT case (i.e., linalg.matmul), you still need to handle the transpose. It is just a matter of where to handle the transpose. DT opens the other door that we can revisit the layouts optimization at graph/model level. The overheads of relayout ops hopefully can be amortized in fusion, const-eval, etc. We have observed it for some models that we've been tracking.

(*) I'm saying probably as today I can't really lower linalg.mmt4d to SVE (one scalable dim), let alone SME (two scalable dims) - I'd like to do that to confirm this 100%.

This is definitely okay as linalg.mmt4d is not a special op. It is just a Linalg contraction op, and it is even easier than convolution ops. I think my question would be which dimension would be scalable dim? In your IRs, it looks like the inner tiles will be related to %vscale. The linalg.mmt4d and tensor.pack will be formed in different dispatches, which means that they will be launched in different kernels. Do we need to make sure the %vscale is the same value in both kernels?

One potential solution is going with ukernels. There is a path to query tile sizes during runtime through ukernels, but it now only works on VMVX backend. We'll need to port the functionality to our llvm-cpu backend if that's the case.

%pack = tensor.pack %2 padding_value(%cst : f32)
  outer_dims_perm = [0, 1]
  inner_dims_pos = [0, 1]
  inner_tiles = [8, 1]
  into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
// Pack matrix B - N dim is dynamic
%pack_3 = tensor.pack %5 padding_value(%cst : f32)
  outer_dims_perm = [1, 0]
  inner_dims_pos = [1, 0]
  inner_tiles = [[8], 1]
  into %12 : tensor<?x?xf32> -> tensor<?x?x?x1xf32>
%pack_6 = tensor.pack %8 padding_value(%cst : f32)
  outer_dims_perm = [0, 1]
  inner_dims_pos = [0, 1]
  inner_tiles = [8, [8]]
  into %15 : tensor<?x?xf32> -> tensor<?x?x8x?xf32>
%16 = linalg.mmt4d
  ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>)
  outs(%pack_6 : tensor<?x?x?x8xf32>) -> tensor<?x?x?x8xf32>
%unpack = tensor.unpack %16
  outer_dims_perm = [0, 1]
  inner_dims_pos = [0, 1]
  inner_tiles = [8, [8]]
  into %17 : tensor<?x?x8x?xf32> -> tensor<?x?xf32>

The other question we need to think of is that what the data layout should be in memory. The other way I can think for data-tiling on SVE is making the outer reduction dimension be scalable. We still have inner tile sizes be base vector sizes (like below snippet). We mark outer dims as scalable, so we can still load a big contiguous memory chunk for each computation.

There are six loops in the mmt4d op, which is M0, N0, K0, M1, N1, K1. The LHS will be packed to M0K0M1K1, the RHS will be packed to N0K0N1K1, and the output will be packed to N0M0N1M1. If the inner tile sizes are all static and we mark K0 as scalable, we have

The %vscale is not required to be the same for pack and mmt4d kernels.
We can set [1, 1, [1], 8, 8, 1] vector size to the mmt4d op, and it still can get vectorized. Then the shapes after tiling are a. LHS=1x[1]x8x1 b. RHS=1x[1]x8x1 c. OUT=1x1x8x8

If this is what we want, I can see a path to enable it for SVE. If not, we can revisit the ukernel path or define another linalg op. We don't always need to use mmt4d op in data-tiling.

%pack = tensor.pack %2 padding_value(%cst : f32)
  outer_dims_perm = [0, 1]
  inner_dims_pos = [0, 1]
  inner_tiles = [8, 1]
  into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>
%pack_3 = tensor.pack %5 padding_value(%cst : f32)
  outer_dims_perm = [1, 0]
  inner_dims_pos = [1, 0]
  inner_tiles = [8, 1]
  into %12 : tensor<?x?xf32> -> tensor<?x?x?x1xf32>
%pack_6 = tensor.pack %8 padding_value(%cst : f32)
  outer_dims_perm = [0, 1]
  inner_dims_pos = [0, 1]
  inner_tiles = [8, 8]
  into %15 : tensor<?x?xf32> -> tensor<?x?x8x?xf32>
%16 = linalg.mmt4d
  ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x8x1xf32>)
  outs(%pack_6 : tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32>
%unpack = tensor.unpack %16
  outer_dims_perm = [0, 1]
  inner_dims_pos = [0, 1]
  inner_tiles = [8, [8]]
  into %17 : tensor<?x?x8x8xf32> -> tensor<?x?xf32>

Thanks for taking a look @hanhanW !

The linalg.mmt4d and tensor.pack will be formed in different dispatches, which means that they will be launched in different kernels. Do we need to make sure the %vscale is the same value in both kernels?

Yes.

Imagine that we have two hypothetical CPUs:

CPU-1 with vscale = 1 (128-bit wide vectors),
CPU-2 with vscale = 4 (512-bit wide vectors).

My current design will be incorrect if packing happens on CPU-1 and MMT4D is run on CPU-2. Now, this is very hypothetical and couldn't happen in practice today (based on what hardware is available). Once SME becomes available, we will have the option to use either:

"regular" SVE (host CPU), or
"Streaming" SVE (SME core).

These will likely have different vscale. Hence, we will need to make sure that:

tensor.pack + linalg.mmt4d + tensor.unpack

are all run with either "Streaming" SVE enabled or disabled. This is based on a ref assembly implementation that I have access to - it should be available publicly soon (I will ask for ETAs after the Easter weekend).

One potential solution is going with ukernels.

We need to make sure that whatever we design/implement will work with and *without ukernels. That's basically the requirement that we have :)

The other question we need to think of is that what the data layout should be in memory.

Looking at RHS (SVE example - 1 scalable dim), this would still be N0K0N1K1, but N1 would be 8 * vscale rather than plain 8.

The other way I can think for data-tiling on SVE is making the outer reduction dimension be scalable.

Interesting idea, I haven't really thought about that. How would we make sure that the generated fmlas operate on scalable (e.g. vector<[8]xf32>) rather than fixed-width vectors (e.g. vector<8xf32>)? This should be possible, but feels like extra work best avoided? (i.e. it's an additional "challenge" compared to making the inner tile sizes "scalable" instead).

We don't always need to use mmt4d op in data-tiling.

IIUC, with my current proposal we should be able to re-use linalg.mmt4d, right?

Good to see we're at the point where the way we lower to pack/unpack today before dispatch region formation is unable to satisfy the requirements. We've known about it forever and now we have a concrete example we can use to work through things! We need to move towards encodings on the tensors instead of explicitly baking out the exact pack/unpack ops so early in the pipeline. Going to be non-trivial and I don't think it blocks this work but it is something we need to really start on - and more strongly avoid any new code assuming that pack/unpack is baked out prior to dispatch region formation/device placement - I'm looking at you, MaterializeHomogeneousEncodingsPass and host-side CPUMaterializeUpperBoundTileSizePass 😠

Imagine that we have two hypothetical CPUs: CPU-1 with vscale = 1 (128-bit wide vectors), CPU-2 with vscale = 4 (512-bit wide vectors).

I have a naive question. Is vscale a fixed value given a fixed CPU? E.g., Does the CPU-1 always have vscale=1 and CPU-2 always have vscale=4? If so, I wonder if that can be queried during runtime. In theory, the host side can store the value to a global variable, so the device would know how to relayout the data. (ukernel was a heavy term. What I meant is using a global variable and I messed up the context with existing path.)

Is vscale a fixed value given a fixed CPU? E.g., Does the CPU-1 always have vscale=1 and CPU-2 always have vscale=4?

That's a totally valid question and the answer is YES :) (given this hypothetical implementation where CPU-1 has vscale=1 and CPU-2 has vscale=4). In fact, it's one of they key design principles of SVE:

the value of vscale is not know at compile time, but known (and fixed) at runtime.

If want to better grasp this concept, you can play with this example:

func.func @get_vscale() -> index {
  %vs = vector.vscale
  return %vs : index
}

To compile:

mlir-opt -test-lower-to-llvm file.mlir | mlir-translate --mlir-to-llvmir | llc -mtriple=aarch64 -mattr=+sve

I get the following sequence of ASM:

    rdvl    x8, #1             ; Number of bytes in an SVE vector reg.
    lsr x0, x8, #4         ; Number of 128bit "chunks" in an SVE vector reg. (i.e. vscale)

The actual value will depend whether this is run on CPU-1 or CPU-2 (the assembly would be identical, though we may need to enable "Streaming" SVE if CPU-2 is an SME device/accelerator).

If so, I wonder if that can be queried during runtime. In theory, the host side can store the value to a global variable, so the device would know how to relayout the data.

Yes, that would be possible. We would do something like this:

  // 1. Compute inner tile size based on CPU-2 configuration
  // Similar as @get_vscale above, but we'd make sure that even when run on CPU-1, 
  // the return value would correspond to CPU-2 (we know how to do it and it's not hard).
  %vs_cpu2 = @get_ssve_vscale() : index
  %inner_tile_size = @compute_tile_size(%vs_cpu2 : index)

  // 2. Pack A
  // Can be run on CPU-1 or CPU-2
  %pack = tensor.pack %2 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %10 : tensor<?x?xf32> -> tensor<?x?x8x1xf32>

  // 2. Pack matrix B
  // N dim is dynamic - calculated using @compute_tile_size. Can be run on CPU-1 or CPU-2
  %pack_3 = tensor.pack %5 padding_value(%cst : f32) outer_dims_perm = [1, 0] inner_dims_pos = [1, 0] inner_tiles = [%inner_tile_size, 1] into %12 : tensor<?x?xf32> -> tensor<?x?x?x1xf32>

  // 3. Pack matrix C
  // N dim is dynamic - calculated using @compute_tile_size. Can be run on CPU-1 or CPU-2
  %pack_6 = tensor.pack %8 padding_value(%cst : f32) outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, %inner_tile_size] into %15 : tensor<?x?xf32> -> tensor<?x?x8x?xf32>

  // 4. MMT4D
  // Inner tile size calculated for CPU-2 - _must_ be run on CPU-2
  %16 = linalg.mmt4d ins(%pack, %pack_3 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%pack_6 : tensor<?x?x?x8xf32>) -> tensor<?x?x?x8xf32>

  // 5. Unpack the result
  // Can be run on CPU-1 or CPU-2
  %unpack = tensor.unpack %16 outer_dims_perm = [0, 1] inner_dims_pos = [0, 1] inner_tiles = [8, %inner_tile_size] into %17 : tensor<?x?x8x?xf32> -> tensor<?x?xf32>

However:

I prefer to refrain from this in our first iteration,
Instead, I suggest that we assume that everything will be run on either CPU-1 or CPU-2.

My main rationale - lets prove the concept in the more basic set-up (which is already quite complex). Only then I'd expand to more complex scenarios. Also, we are yet to investigate the heuristics for switching between CPU-1 and CPU-2 (and the impact on perf).

Thanks for all the details, this is very helpful. I was afraid that the current data-tiling does not serve the SVE's needs. But it now looks very okay to me.

the value of vscale is not know at compile time, but known (and fixed) at runtime. Instead, I suggest that we assume that everything will be run on either CPU-1 or CPU-2.

Yes, let's start with a single device. We are also having more discussion about data-tiling with heterogeneous devices, and figuring next steps. See https://github.com/openxla/iree/issues/16933#issuecomment-2030909099 for more details. It's good to know that they are fixed, so we don't need to build more aggressive features.

The target device information is carried on executables, which means that we generate the code for the specific CPU. I.e., vscale is always the same between different executables -- if they have the same target_device. There would be an amount of work to support multi devices, so let's start with a single device using current infra. Running which dispatch on which device is jobs of stream/hal dialect; we don't need to worry about them at this moment.

I think the next question is that which dimension do you want to mark scalable? I played a bit with the below snippet and can get it compile until ConvertToLLVM pass. The below snippet could help us scope the work.

#config = #iree_codegen.lowering_config<tile_sizes = [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [1, 1, 0, [8], 8, 0], [0, 0, 1, 0, 0, 1], [0, 0, 0, 0, 0, 0]]>
#translation = #iree_codegen.translation_info<Mmt4dTilingExpert>
#compilation = #iree_codegen.compilation_info<lowering_config = #config, translation_info = #translation>
module {
  func.func @foo(%arg0: tensor<?x?x8x1xf32>, %arg1: tensor<?x?x?x1xf32>, %arg2: tensor<?x?x8x?xf32>) -> tensor<?x?x8x?xf32> {
    %0 = linalg.mmt4d {compilation_info = #compilation} ins(%arg0, %arg1 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%arg2 : tensor<?x?x8x?xf32>) -> tensor<?x?x8x?xf32>
    return %0 : tensor<?x?x8x?xf32>
  }
}

We can get the below vector.contract op when we mark M1 scalable.

%75 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d2, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d1, d2, d4, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d3, d4)>],
    iterator_types = ["parallel", "parallel", "reduction", "parallel", "parallel", "reduction"],
    kind = #vector.kind<add>
  }
  %72, %73, %74 : vector<1x1x[8]x1xf32>, vector<1x1x8x1xf32> into vector<1x1x[8]x8xf32>

To repro the IR dump:

Apply the patch
Run iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=+sve ~/mmt4d.mlir -o /tmp/z.vmfb

I think the next question is that which dimension do you want to mark scalable?

For linalg.matmul we make the N dim scalable, so for linalg.mmt4d it should probably be N1. That's for SVE. For SME it would be N1 and M1.

I played a bit with the below snippet and can get it compile until ConvertToLLVM pass.

Nice, thanks for sharing! In my 2nd post in this ticket you will find a few PRs where I'm trying to make masked vectorisation work for linalg.mmt4d - that's meant to unlock "scalable" vectorisation. I've not re-evaluated since merging those - I wanted to post this RFC and see what you think before I continue 😅

Now, your patch makes me realise that perhaps my pass pipeline was incomplete 🤔 I was actually using standalone MLIR as my reference point, but I should probably switch to IREE sooner rather than later.

I was actually using standalone MLIR as my reference point, but I should probably switch to IREE sooner rather than later.

I think it can be reproduced using mlir-opt, see the below example. The IREE version is just for pre-setting tile sizes and pipeline, so I can quickly run some experiments. :)

// Run "mlir-opt --transform-interpreter repro.mlir"
module {
  func.func @foo(%arg0: tensor<?x?x8x1xf32>, %arg1: tensor<?x?x?x1xf32>, %arg2: tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32> {
    %0 = linalg.mmt4d ins(%arg0, %arg1 : tensor<?x?x8x1xf32>, tensor<?x?x?x1xf32>) outs(%arg2 : tensor<?x?x8x8xf32>) -> tensor<?x?x8x8xf32>
    return %0 : tensor<?x?x8x8xf32>
  }
  module attributes {transform.with_named_sequence} {
    transform.named_sequence @__transform_main(%arg0: !transform.any_op {transform.readonly}) {
      %0 = transform.structured.match ops{["linalg.mmt4d"]} in %arg0 : (!transform.any_op) -> !transform.any_op
      %tiled_linalg_op, %loops:6 = transform.structured.tile_using_for %0[1, 1, 1, [8], 8, 1] : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op)
      %1 = transform.structured.match ops{["linalg.mmt4d"]} in %arg0 : (!transform.any_op) -> !transform.any_op
      transform.structured.vectorize %tiled_linalg_op vector_sizes [1, 1, 1, [8], 8, 1] : !transform.any_op
      transform.yield
    }
  }
}

It generates the vector_reduction<add> which can be folded into vector.contract:

%24 = vector.mask %23 { vector.multi_reduction <add>, %22, %21 [2, 5] : vector<1x1x1x[8]x8x1xf32> to vector<1x1x[8]x8xf32> } : vector<1x1x1x[8]x8x1xi1> -> vector<1x1x[8]x8xf32>

iree-org / iree