Open LongshengDu opened 2 months ago
Looks pretty interesting 👍 Explosion of Linalg operations like convolution and matmul flavors is definitely something that is worth addressing.
The main question for me right now is: what is the advantage of this custom m/n/k_packing
mapping vs simply adding indexing_maps
like in linalg.generic
?
IMO, the proposed mapping is a bit confusing mostly because maps are per dimension but their indexing is with respect to operands. The same information is equally captured with affine_map
s on which I can also use existing affine utilities to perform more generic analysis.
I have not checked the downstream PoC yet but I think the biggest strength of such op would be to provide helpers that can answer common questions like whether outer/inner are transposed, maybe some information about packing etc.
Ideally, it should be trivial for a user to determine whether what type of matmul it is, for example mmt4d
, by asking the packed_matmul
's API a few simple questions.
The whole named ops design banks on a simple premise: do not create more ops for different behaviour, learn how to parse the DAG of def-use chains that represent a pattern.
Linalg, however, has added more and more named ops, going against that design principle. For example, linalg.matmul_transpose_a
and linalg.matmul_transpose_b
. These can easily be represented as the following chains:
linalg.matmul(linalg.transpose(A), B, C)
linalg.matmul(A, linalg.transpose(B), C)
What you propose can also be represented as:
linalg.matmul(A, linalg.pack(B), C)
with the pack being a VNNI transpose and the second operand having a higher rank (rank(B) = rank(A)+1
).There is no need to create a new indexing map or a new op.
There is no need to create a new indexing map or a new op.
In principal I agree and that's the direction we're following. However, the presence of these ops upstream suggests not everybody follows the same approach.
The idea to at least unify these existing op variants has been floating around for a while. Perhaps, this could be a way to replace all these linalg.(batch_)matmul
ops into sth simpler as I think simply removing them won't be welcomed.
However, there is great benefit in following @rengolin's approach.
Using more "fundamental" operations like linalg.matmul
, linalg.generic
, tensor.(un)pack
etc. allows to leverage existing tooling and transforms. Many upstream patterns do not even consider these specialized ops like linalg.batch_matmul_transpose_b
.
This presents really large cost of integrating any new operations into the existing ecosystem.
Long term, it should be much cheaper and easier to add a few DAG matchers (downstream) than to pay the cost of introducing new operation.
The idea to at least unify these existing op variants has been floating around for a while. Perhaps, this could be a way to replace all these
linalg.(batch_)matmul
ops into sth simpler as I think simply removing them won't be welcomed.
To remove the variations we need to add a pattern match system that allows one to match a DAG of ops and replace them with another DAG of ops. Without this, current downstream implementations that rely on those ops being unique will break. With this, the migration cost becomes acceptable and we can convince them with a PR.
The main items for this to happen are:
matchAndRewrite
function (form a OpRewritePattern
) that can match a DAG, not just a single op.Note: This is only necessary for downstream implementations that are already using them. You can simply not rely on them and pattern match against DAGs locally and not have to care about this problem.
To remove the variations we need to add a pattern match system that allows one to match a DAG of ops and replace them with another DAG of ops.
We have another request based on your proposal: Mark a pattern as atomic which doesn't allow inserting other ops in the between. The reason that we want to introduce more and more op variants is, we want to make it as an atomic op that won't be broken. Take linalg.matmul_transpose_b
as an example, linalg.matmul(A, linalg.pack(B), C)
will be impacted by const-folding pass which may fold linalg.pack(B)
out. But our hardware prefers to do linalg.matmul(A, linalg.pack(B), C)
as a whole, then it will be a trouble. Of course you can give that transpose op an attribute to avoid folding on it, but as more and more passes are added, it's hard to track this. We need a mechanism to mark a pattern atomic: For most transformation passes, you need to pattern match the whole atomic DAG instead of any subgraph of it.
An attribute to mark as atomic will need to link with the atom
, so must have some kind of identifier. For example two matmul ops with some element-wise in between: to which matmul the element-wise ops are an atom of?
The discussion on the round table at EuroLLVM was that a grouping op would be helpful here. However, after speaking with people afterwards, we agree that a catch-all grouping op would not be ideal, because we'd have to give it contrasting semantics.
We need one grouping that can help us fuse to match patterns / kernels, another that can help us tile complex patterns (such as softmax), another that can help us separate code between devices (of the same type or heterogeneous), etc.
For now, just having the DAG patterns and not running any canonicalizations in between should be enough. Once we start crossing canonical boundaries or whole-graph transformations (such as one-shot bufferize, vectorize, etc), we'll need a grouping mechanism of some form.
Motivation
Current linalg dialect only supports fixed input/output shapes for structured named matmul ops, if we want to define different packings for matmul op in existing framework, we have to add each one of them separately. Noticeably, there are a lot of packing method for matmuls and a lot of them are related to hardwares(BFMMLA,VNNI), even some simple packings requires more than a few ops to cover them, e.g. linalg.mmt4d is for (MNmn += Mkmk NKnk), but for (MNmn += Mkmk NKkn), etc, we there is no named op to represent. Fundamentally, we also want to avoid adding every single op for each combination.
Thus this post is presenting a new flexible linalg.packed_matmul op using linalg.packing_map attr to map different matmul packing that will help developer utilize named ops to represent matmuls to there needs.
Background
For a simple matmul with input/output shape: (M,N)+=(M,K)*(K,N), dims: A(0,1), B(0,1), C(0,1) M is mapped as A(0) -> C(0) N is mapped as B(1) -> C(1) K is mapped as A(1) -> B(0)
3 iterations is required to carry out the calculation:
For a packed matmul, it is known the 3 dimension M,N,K is presented inside the input/output shapes and if the mapping between 3 kind of dims are determined, the calculation can be finalized.
For example a VNNI packed matmul can be represent as: A(M,N,M0,N0) += A(M,K,M0,K01)*B(N,K,K0,N0,K1), So for dims: A(0,1,2,3), B(0,1,2,3,4), C(0,1,2,3) M is mapped as A(0) -> C(0); A(2) -> C(2) N is mapped as B(0) -> C(1); B(3) -> C(3) K is mapped as A(1) -> B(1); A(3) -> B(2, 4)
In this case, 7 iterations is required to carry out the calculation and can be represent as:
The linalg.packing_map Attr
e.g. in A[a,b] -> B[x,y,z], if dim [a] corresponding to dim [x]; dim [b] corresponding to packed dims [y,z]. We can express it as
linalg.packing_map<[a] -> [x]>
,linalg.packing_map<[b] -> [y,z]>
, where dims mapping order is A -> Blinalg.packing_map
is defined as an attr, it requires 2 int64 arrays params to represent a mapping between 2 sets of sorted indices. It is needed to verify that one of them contain only 1 index, since multi-dims to multi-dims mapping is not allowed. This will define a 1->N index set mapping, src is the 1 index, dst is the multi-dims index list. Some helpers are provided to get the mapping order(first<-second or first->second) and mapping src/dst indices.The linalg.packed_matmul Op
linalg.packed_matmul
is defined as a structured named op with LinalgOp Interface, it requires 2 input (A, B) and 1 init (C), the body computesu5(u1(c) + u2(u3(a) * u4(b)))
. It also requires 3 attrs: m_packing, n_packing, k_packing. Each of them is a list oflinalg.packing_map
attr to map every type of matrix dimension, each attr in this list is to represent a mapping between 2 sets of indices from matrix A, B, C. We define mapping order as: m_packing A->C, n_packing B->C, k_packing A->BFor the VNNI packed matmul example above, we can express it as:
How to verify
Since the mapping is explicit, these are the criteria to verify this op:
How to getIteratorTypesArray
m_packing, n_packing represented iterations are considered
parallel
, and k_packing represented iterations are consideredreduction
.Take above example: M is mapped as A(0) -> C(0); A(2) -> C(2) N is mapped as B(0) -> C(1); B(3) -> C(3) K is mapped as A(1) -> B(1); A(3) -> B(2, 4)
loops iterator types: d0: C(0): parallel d1: C(2): parallel d2: C(1): parallel d3: C(3): parallel d4: B(1): reduction d5: B(2): reduction d6: B(4): reduction
How to getIndexingMaps
Each packing_map will represent how symbols can be added to indexing maps. For packing_map dst, AffineExpr for its indices are the AffineSymbols that representing the iterator; For packing_map src, AffineExpr for its index is a compound expr that calculated as its indexing related to the dst AffineSymbols and dim size.
The generalized Op
This op offers more levels of representation for matmul that want to be expressed as named ops, while can be lowered to linalg.generic using -linalg-generalize-named-ops.
Take above VNNI packed matmul example:
Conclusion
Welcome to contribute to this design, code have been implemented locally to test this. But before a formal PR I think lot of decision should be discussed first: should mapping order(e.g. A->C, B->C) be explicit on the attr? How to better accommodate batch matmul and batch reduce matmul in the future? How to better deal with dynamic dims, etc.