Open chelini opened 1 year ago
This IR looks similar to what we generate from mlir-gen
, but the shapes are weird (5x5
), where is that from?
The MHA softmax is a little different, we need both styles covered.
The other option is to use linalg fusion on tensor but limit the pass only to the softmax operations to avoid having to split the body of the generic later on before mapping to tpps.
softmax in libxsmm is lowered as an equation, and just calling the kernels one after another is very close to optimal. I would not create complicated machinery that is specific to certain complex patterns unless the benefit was very large and there was no other way.
Softmax will eventually be lowered as an equation, which is the right way long term, so we can live with most of the performance now and the rest later.
Yes, calling the kernel one after the other would be the plan. Still, we must either fuse along 64 and 8 to extract 2d tensors or materialize the two outermost dimensions for each linalg ops and replace the body with a tpp operation. Do you have an example of the IR generated by mlir-gen
? 5 is an arbitrary number for the sequence length, it does not matter in this context.
Do you have an example of the IR generated by mlir-gen
Yup. just run mlir-gen
and you'll see.
Also, just to be clear, this is really low priority. Finding the right shapes for MHA and finally getting TPP on tensors in the main pipeline are still the most important tasks right now.
To enable running softmax with TPPs we need more operations:
The IR below shows a Softmax example in Linalg, extracted from a self-attention layer. The lowering is: TF dialect -> StableHLO -> Linalg IR. To lower from TF dialect to StableHLO we use tf-opt while from StableHLO to linalg we use the IREE compiler and print after
iree-stablehlo-to-iree-input
.The dimension of
arg0
are:[B, heads, T, T]
where B is the batched dimension, heads is the number of heads, while T is the sequence length.Related to #414.
TBD: