Closed almavi closed 2 weeks ago
Short answer: Yes, we think so.
Long answer: Our focus is on enabling our software stack for M4 (incl. SME). Like many other developers, we have 512-bit SVE code ready-to-go. This goes back to A64FX. The point of only testing the accumulation in the Z registers is that we would have tweaked the GEMM code slightly for a free speedup.
But: Given the poor performance, we are limiting our attention to MOPA and Neon. Please feel free to submit a PR if you'd like to add another micro to the repo. We'll return to the SME vector capabilities once we are done with the GEMMs. FYI: We are actively working on this. Right now it's the mundane task of encoding the necessary SME instructions, detecting M4, and getting our JITters to run on iOS. I understand that our small homepage does not show the full scope of our project. Hopefully this will become clearer when we are more advanced.
It is likely that maximum FMLA (not outer product) performance is achieved using the multi-register FMA instructions that target the ZA array, instead of Z registers. For instance:
https://developer.arm.com/documentation/ddi0602/2024-03/SME-Instructions/FMLA--multiple-and-single-vector---Multi-vector-floating-point-fused-multiply-add-by-vector-
Most likely, the 4 register version of FMLA would offer the best performance, specially compared with regular SVE FMLA instructions targeting Z registers.