[RFC] Plan to introduce packed_matmul op and packing_map attr for flexible packing matmul representation in upstream mlir

Motivation

Current linalg dialect only supports fixed input/output shapes for structured named matmul ops, if we want to define different packings for matmul op in existing framework, we have to add each one of them separately. Noticeably, there are a lot of packing method for matmuls and a lot of them are related to hardwares(BFMMLA,VNNI), even some simple packings requires more than a few ops to cover them, e.g. linalg.mmt4d is for (MNmn += Mkmk NKnk), but for (MNmn += Mkmk NKkn), etc, we there is no named op to represent. Fundamentally, we also want to avoid adding every single op for each combination.

Thus this post is presenting a new flexible linalg.packed_matmul op using linalg.packing_map attr to map different matmul packing that will help developer utilize named ops to represent matmuls to there needs.

Background

For a simple matmul with input/output shape: (M,N)+=(M,K)*(K,N), dims: A(0,1), B(0,1), C(0,1) M is mapped as A(0) -> C(0) N is mapped as B(1) -> C(1) K is mapped as A(1) -> B(0)

3 iterations is required to carry out the calculation:

for (m : range(M))
  for (n : range(N))
    for (k : range(K))
      C[m, n] += A[m, k] + B[k, n]

For a packed matmul, it is known the 3 dimension M,N,K is presented inside the input/output shapes and if the mapping between 3 kind of dims are determined, the calculation can be finalized.

For example a VNNI packed matmul can be represent as: A(M,N,M0,N0) += A(M,K,M0,K01)*B(N,K,K0,N0,K1), So for dims: A(0,1,2,3), B(0,1,2,3,4), C(0,1,2,3) M is mapped as A(0) -> C(0); A(2) -> C(2) N is mapped as B(0) -> C(1); B(3) -> C(3) K is mapped as A(1) -> B(1); A(3) -> B(2, 4)

In this case, 7 iterations is required to carry out the calculation and can be represent as:

for (m : range(M))
  for (n : range(N))
    for (m0 : range(M0))
      for (n0 : range(N0))
        for (k : range(K))
          for (k0 : range(K0))
            for (k1 : range(K1))
              C[m, n, m0, n0] += A[m, k, m0, k0 * K1 + k1] + B[n, k, k0, n0, k1]

The linalg.packing_map Attr

e.g. in A[a,b] -> B[x,y,z], if dim [a] corresponding to dim [x]; dim [b] corresponding to packed dims [y,z]. We can express it as linalg.packing_map<[a] -> [x]>, linalg.packing_map<[b] -> [y,z]>, where dims mapping order is A -> B

linalg.packing_map is defined as an attr, it requires 2 int64 arrays params to represent a mapping between 2 sets of sorted indices. It is needed to verify that one of them contain only 1 index, since multi-dims to multi-dims mapping is not allowed. This will define a 1->N index set mapping, src is the 1 index, dst is the multi-dims index list. Some helpers are provided to get the mapping order(first<-second or first->second) and mapping src/dst indices.

def PackingMapAttr : Linalg_Attr<"PackingMap", "packing_map"> {
  let parameters = (ins ArrayRefParameter<"uint64_t">:$first,
                        ArrayRefParameter<"uint64_t">:$second);

  let assemblyFormat = "`<` `[` $first `]` `->` `[` $second `]` `>`";

  let extraClassDeclaration = [{
    /// Index first is 0; Index second is 1
    unsigned getPackingSrcIndex();
    unsigned getPackingDstIndex();
    /// SrcDims.size() == 1; DstDims.size() >= 1
    ArrayRef<uint64_t> getPackingSrcDims();
    ArrayRef<uint64_t> getPackingDstDims();
  }];
}

The linalg.packed_matmul Op

linalg.packed_matmul is defined as a structured named op with LinalgOp Interface, it requires 2 input (A, B) and 1 init (C), the body computes u5(u1(c) + u2(u3(a) * u4(b))) . It also requires 3 attrs: m_packing, n_packing, k_packing. Each of them is a list of linalg.packing_map attr to map every type of matrix dimension, each attr in this list is to represent a mapping between 2 sets of indices from matrix A, B, C. We define mapping order as: m_packing A->C, n_packing B->C, k_packing A->B

For the VNNI packed matmul example above, we can express it as:

#m_packing_vnni = [#linalg.packing_map<[0] -> [0]>, #linalg.packing_map<[2] -> [2]>]
#n_packing_vnni = [#linalg.packing_map<[0] -> [1]>, #linalg.packing_map<[3] -> [3]>]
#k_packing_vnni = [#linalg.packing_map<[1] -> [1]>, #linalg.packing_map<[3] -> [2, 4]>]
func.func @packed_matmul(%A: tensor<2x8x32x32xbf16>, %B: tensor<4x8x16x32x2xbf16>, 
                      %C: tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16> {
  %0 = linalg.packed_matmul 
        {m_packing = #m_packing_vnni, n_packing = #n_packing_vnni, k_packing = #k_packing_vnni}
        ins(%A, %B : tensor<2x8x32x32xbf16>, tensor<4x8x16x32x2xbf16>)
        outs(%C : tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16>
  return %0 : tensor<2x4x32x32xbf16>
}

How to verify

Since the mapping is explicit, these are the criteria to verify this op:

linalg.packed_matmul input/output must have rank
all dims inside linalg.packing_map must be permutation of all of dims of A,B,C
all of mapping dims must match size, i.e. in 1->N index set mapping, product of N-dims size equals the 1-dim size
dynamic dims are ignored when matching size (this part need discussion)

How to getIteratorTypesArray

m_packing, n_packing represented iterations are considered parallel, and k_packing represented iterations are considered reduction.

Take above example: M is mapped as A(0) -> C(0); A(2) -> C(2) N is mapped as B(0) -> C(1); B(3) -> C(3) K is mapped as A(1) -> B(1); A(3) -> B(2, 4)

loops iterator types: d0: C(0): parallel d1: C(2): parallel d2: C(1): parallel d3: C(3): parallel d4: B(1): reduction d5: B(2): reduction d6: B(4): reduction

How to getIndexingMaps

Each packing_map will represent how symbols can be added to indexing maps. For packing_map dst, AffineExpr for its indices are the AffineSymbols that representing the iterator; For packing_map src, AffineExpr for its index is a compound expr that calculated as its indexing related to the dst AffineSymbols and dim size.

compound = 0;
for (dstDim : dstDimsArr) {
    currDimExpr = AffineDimExpr(curr);
    dstExprs[dstDim] = currDimExpr;
    compound = compound * getDimSize(dstDim) + currDimExpr ;
}
srcExprs[srcDim] = compound;

// e.g. for a single linalg.packing_map<[0] -> [0, 1, 2]>, with shape: src[64] -> dst[8, 4, 2], 
// indexing maps is src[(d0 * 4 + d1) * 2 + d2], dst[d0, d1, d2]

The generalized Op

This op offers more levels of representation for matmul that want to be expressed as named ops, while can be lowered to linalg.generic using -linalg-generalize-named-ops.

Take above VNNI packed matmul example:

#map = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d4, d1, d5 * 2 + d6)>
#map1 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d2, d4, d5, d3, d6)>
#map2 = affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d2, d1, d3)>
module {
  func.func @packed_matmul(%arg0: tensor<2x8x32x32xbf16>, %arg1: tensor<4x8x16x32x2xbf16>, 
                           %arg2: tensor<2x4x32x32xbf16>) -> tensor<2x4x32x32xbf16> {
    %0 = linalg.generic {
      indexing_maps = [#map, #map1, #map2], 
      iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction", "reduction"]} 
    ins(%arg0, %arg1 : tensor<2x8x32x32xbf16>, tensor<4x8x16x32x2xbf16>) 
    outs(%arg2 : tensor<2x4x32x32xbf16>) { 
    ^bb0(%in: bf16, %in_0: bf16, %out: bf16):
      %1 = arith.mulf %in, %in_0 : bf16
      %2 = arith.addf %out, %1 : bf16
      linalg.yield %2 : bf16
    } -> tensor<2x4x32x32xbf16>
    return %0 : tensor<2x4x32x32xbf16>
  }
}

Conclusion

Welcome to contribute to this design, code have been implemented locally to test this. But before a formal PR I think lot of decision should be discussed first: should mapping order(e.g. A->C, B->C) be explicit on the attr? How to better accommodate batch matmul and batch reduce matmul in the future? How to better deal with dynamic dims, etc.

Looks pretty interesting 👍 Explosion of Linalg operations like convolution and matmul flavors is definitely something that is worth addressing.

The main question for me right now is: what is the advantage of this custom m/n/k_packing mapping vs simply adding indexing_maps like in linalg.generic? IMO, the proposed mapping is a bit confusing mostly because maps are per dimension but their indexing is with respect to operands. The same information is equally captured with affine_maps on which I can also use existing affine utilities to perform more generic analysis.

I have not checked the downstream PoC yet but I think the biggest strength of such op would be to provide helpers that can answer common questions like whether outer/inner are transposed, maybe some information about packing etc. Ideally, it should be trivial for a user to determine whether what type of matmul it is, for example mmt4d, by asking the packed_matmul's API a few simple questions.

The whole named ops design banks on a simple premise: do not create more ops for different behaviour, learn how to parse the DAG of def-use chains that represent a pattern.

Linalg, however, has added more and more named ops, going against that design principle. For example, linalg.matmul_transpose_a and linalg.matmul_transpose_b. These can easily be represented as the following chains:

linalg.matmul(linalg.transpose(A), B, C)
linalg.matmul(A, linalg.transpose(B), C)

What you propose can also be represented as:

linalg.matmul(A, linalg.pack(B), C) with the pack being a VNNI transpose and the second operand having a higher rank (rank(B) = rank(A)+1).

There is no need to create a new indexing map or a new op.

There is no need to create a new indexing map or a new op.

In principal I agree and that's the direction we're following. However, the presence of these ops upstream suggests not everybody follows the same approach. The idea to at least unify these existing op variants has been floating around for a while. Perhaps, this could be a way to replace all these linalg.(batch_)matmul ops into sth simpler as I think simply removing them won't be welcomed.

However, there is great benefit in following @rengolin's approach.

Using more "fundamental" operations like linalg.matmul, linalg.generic, tensor.(un)pack etc. allows to leverage existing tooling and transforms. Many upstream patterns do not even consider these specialized ops like linalg.batch_matmul_transpose_b. This presents really large cost of integrating any new operations into the existing ecosystem. Long term, it should be much cheaper and easier to add a few DAG matchers (downstream) than to pay the cost of introducing new operation.

The idea to at least unify these existing op variants has been floating around for a while. Perhaps, this could be a way to replace all these linalg.(batch_)matmul ops into sth simpler as I think simply removing them won't be welcomed.

To remove the variations we need to add a pattern match system that allows one to match a DAG of ops and replace them with another DAG of ops. Without this, current downstream implementations that rely on those ops being unique will break. With this, the migration cost becomes acceptable and we can convince them with a PR.

The main items for this to happen are:

A named op DAG matcher (similar to our generic matcher) but for DAGs.
A DAG-to-DAG replacement, which in itself is already needed when replacing more than just one op for another op (@kurapov-peter this is what we discussed). This is not trivial as needs to account for all inputs/outputs and how to select them for the new op.
A sequence of upstream matchers that replace the specialized named ops.
A matchAndRewrite function (form a OpRewritePattern) that can match a DAG, not just a single op.
Replace all existing specialized named op rewrite patterns upstream with their DAG versions.

Note: This is only necessary for downstream implementations that are already using them. You can simply not rely on them and pattern match against DAGs locally and not have to care about this problem.

To remove the variations we need to add a pattern match system that allows one to match a DAG of ops and replace them with another DAG of ops.

We have another request based on your proposal: Mark a pattern as atomic which doesn't allow inserting other ops in the between. The reason that we want to introduce more and more op variants is, we want to make it as an atomic op that won't be broken. Take linalg.matmul_transpose_b as an example, linalg.matmul(A, linalg.pack(B), C) will be impacted by const-folding pass which may fold linalg.pack(B) out. But our hardware prefers to do linalg.matmul(A, linalg.pack(B), C) as a whole, then it will be a trouble. Of course you can give that transpose op an attribute to avoid folding on it, but as more and more passes are added, it's hard to track this. We need a mechanism to mark a pattern atomic: For most transformation passes, you need to pattern match the whole atomic DAG instead of any subgraph of it.

An attribute to mark as atomic will need to link with the atom, so must have some kind of identifier. For example two matmul ops with some element-wise in between: to which matmul the element-wise ops are an atom of?

The discussion on the round table at EuroLLVM was that a grouping op would be helpful here. However, after speaking with people afterwards, we agree that a catch-all grouping op would not be ideal, because we'd have to give it contrasting semantics.

We need one grouping that can help us fuse to match patterns / kernels, another that can help us tile complex patterns (such as softmax), another that can help us separate code between devices (of the same type or heterogeneous), etc.

For now, just having the DAG patterns and not running any canonicalizations in between should be enough. Once we start crossing canonical boundaries or whole-graph transformations (such as one-shot bufferize, vectorize, etc), we'll need a grouping mechanism of some form.

intel / graph-compiler