Conversion of einsum-like operations into matmul-like operations.

MaheshRavishankar commented 1 year ago

For getting reasonable performance on current code-generation paths, einsum-like operations need to be converted into a named matmul-like operations.

https://github.com/openxla/iree/pull/13468 reverted a similar change that was dropped earlier.
13519 is adding a similar pattern on StableHLO.

Since other front-ends might do the same we might need to do this transformation on Linalg itself. This is to document some thoughts on how this could be done at Linalg level. Here the description is considering lowering einsum like operations into batch matmul (which is the most general case of all matmul-like operations).

As an example consider the following einsum operations

#map_lhs = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7) -> (d0, d1, d5, d3, d2, d7)>
#map_rhs = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7) -> (d0, d4, d3, d6, d2, d7)>
#map_out = affine_map<(d0, d1, d2, d3, d4, d5, d6, d7) -> (d0, d1, d2, d4, d5, d6)>
%0 = linalg.generic {
    indexing_maps = [#map_lhs, #map_rhs, #map_out, #map_out]
    iterator_types = ["parallel", "parallel", "parallel", "reduction", "parallel", "parallel", "parallel", "reduction"]}
    ins(%lhs, %rhs : tensor<?x?x?x?x?x?xf32>, tensor<?x?x?x?x?x?xf32>)
    outs(%init : tensor<?x?x?x?x?x?xf32>) {
  ^bb0(%b0 : f32, %b1 : f32, %b2 : f32):
    %1 = arith.mulf %b0, %b1 : f32
    %2 = arith.addf %b2, %1 : f32
    linalg.yield %2 : f32
  } -> tensor<?x?x?x?x?x?xf32>

The first thing to do is characterize the dimensions. Batch matmul has four types of dimensions

Batch dimension
LHS preserved dimensions (i.e. M of the final batch_matmul)
RHS preserved dimensions (i.e. N of the final batch_matmul)
Reductions dimensions (i.e. K of the final batch_matmul).

Multiple dimensions of the original op collapse into these dimensions of the final batch matmul. The characterization of the dimensions of the original op can be done by using these rules

Dimensions that map to parallel iterator types, and are present in the domain of indexing maps for the LHS, RHS and Init/result are the batch dimensions. In the above example those would be d0 and d2.
Dimensions that map to parallel iterator types and are present in the domain of the indexing maps for LHS and Init/Result are the LHS preserved dimensions, i.e. M dimensions. d1 and d5 in the above example.
Dimensions that map to parallel iterator types and are present in the domain of the indexing maps for RHS and Init/Result are the RHS preserved dimensions, i.e N dimensions. d4 and d6 in the above example.
Dimensions that map to reduction iterator types and are present in the domain of the indexing maps for LHS and RHS are the reduction dimensions, i.e. K dimensions. d3 and d7 in the above example.

Further constraints is the body of the operation has to be multiply + add.

A further generalization of these constraints is needed to handle when one or more operands is broadcasted. To handle those the check for the dimensions need to happen in this order

Check for LHS preserved dimensions
Check for RHS preserved dimensions.
Check for Batch dimensions.
Check for Reduction dimensions (this could happen anywhere).

The key is

Reduction dimensions must exist in LHS and RHS operands
The LHS/RHS preserved dimensions cannot be empty (then its just a reduction)

Once we have a classification of the dimensions we need to introduce transpose to get the order as follows

All batch dimensions
Followed by all LHS preserved dimensions
Followed by all RHS preserved dimensions
Followed by the reduction dimensions.

The reshapes will collapse each of the sets above into a single dimension (each).

If the batch dimension is empty we use a matmul
If the batch and LHS/RHS dimensions are empty its a matvec
Reduction dimensions cannot be empty.

Adding this pattern to Linalg (could be done in MLIR) will allow deduplicate the patterns added in #13519 and #13468.

MaheshRavishankar commented 1 year ago

@allieculp this is the description for @NatashaKnk to make progress on einsum -> batch_matmul conversion.. Please add this to appropriate sprints.

MaheshRavishankar commented 1 year ago

cc @silvasean and @rsuderman to verify my logic above.

silvasean commented 1 year ago

This seems right to me.

I wonder if part of this issue should be to generalize all linalg.matmul/linalg.batch_matmul, let them fuse as linalg.generic with reshapes/transposes around them (and possibly other stuff, like reductions), and then finally canonicalize them into linalg.matmul/batch_matmul. Do you think that is useful or in scope? I feel it could give us more performance stability across different ways for users to write the same thing. Cases like https://github.com/openxla/iree/issues/12214 have open-coded broadcast + batch_matmul which could possibly benefit from being handled in the same way as the corresponding einsum, if it were written by the user (in that case the broadcast + batch_matmul is equivalent to a regular matmul with a larger LHS preserved dimension)

iree-org / iree

Conversion of einsum-like operations into matmul-like operations. #13528

13519 is adding a similar pattern on StableHLO.