The i8mm lowering for some vector.contract ops is currently functionally correct. However, performance wise there is some room for improvement. Looking at the generated asm for an mmt4d with 2x2x8 innermost tile sizes, we get:
It calls my attention the mov instruction, esp. the indexing from 1 to 0, the tbl and the ext instructions. This may not seem a big deal but the problem is really exacerbated when using larger tile sizes. We observed large sequences of mov and ext instructions all over the place.
We should investigate what is going on and try to fix the problem. My suspicion is that this zero initialization and insertion for vecmat cases might be behind some of these instructions. We should try if using llvm.undef fixes part of the problem.
The i8mm lowering for some `vector.contract` ops is currently functionally correct. However, performance wise there is some room for improvement. Looking at the generated asm for an mmt4d with 2x2x8 innermost tile sizes, we get:
```
1470: 6e180483 mov v3.d[1], v4.d[0]
1474: 4e006204 tbl v4.16b, { v16.16b, v17.16b, v18.16b, v19.16b }, v0.16b
1478: 4e84a462 smmla v2.4s, v3.16b, v4.16b
147c: 6e024041 ext v1.16b, v2.16b, v2.16b, #0x8
```
It calls my attention the `mov` instruction, esp. the indexing from `1` to `0`, the `tbl` and the `ext` instructions. This may not seem a big deal but the problem is really exacerbated when using larger tile sizes. We observed large sequences of `mov` and `ext` instructions all over the place.
We should investigate what is going on and try to fix the problem. My suspicion is that this [zero initialization and insertion](https://github.com/llvm/llvm-project/blob/aafed3408e7269c42f974189198a47eb6dd2fc84/mlir/lib/Dialect/ArmNeon/Transforms/LowerContractionToSMMLAPattern.cpp#L178-L185) for `vecmat` cases might be behind some of these instructions. We should try if using `llvm.undef` fixes part of the problem.
The i8mm lowering for some
vector.contract
ops is currently functionally correct. However, performance wise there is some room for improvement. Looking at the generated asm for an mmt4d with 2x2x8 innermost tile sizes, we get:It calls my attention the
mov
instruction, esp. the indexing from1
to0
, thetbl
and theext
instructions. This may not seem a big deal but the problem is really exacerbated when using larger tile sizes. We observed large sequences ofmov
andext
instructions all over the place.We should investigate what is going on and try to fix the problem. My suspicion is that this zero initialization and insertion for
vecmat
cases might be behind some of these instructions. We should try if usingllvm.undef
fixes part of the problem.