Use multi-register FMLA targetting ZA

almavi commented 3 months ago

It is likely that maximum FMLA (not outer product) performance is achieved using the multi-register FMA instructions that target the ZA array, instead of Z registers. For instance:

https://developer.arm.com/documentation/ddi0602/2024-03/SME-Instructions/FMLA--multiple-and-single-vector---Multi-vector-floating-point-fused-multiply-add-by-vector-

Most likely, the 4 register version of FMLA would offer the best performance, specially compared with regular SVE FMLA instructions targeting Z registers.

breuera commented 3 months ago

Short answer: Yes, we think so.

Long answer: Our focus is on enabling our software stack for M4 (incl. SME). Like many other developers, we have 512-bit SVE code ready-to-go. This goes back to A64FX. The point of only testing the accumulation in the Z registers is that we would have tweaked the GEMM code slightly for a free speedup.

But: Given the poor performance, we are limiting our attention to MOPA and Neon. Please feel free to submit a PR if you'd like to add another micro to the repo. We'll return to the SME vector capabilities once we are done with the GEMMs. FYI: We are actively working on this. Right now it's the mundane task of encoding the necessary SME instructions, detecting M4, and getting our JITters to run on iOS. I understand that our small homepage does not show the full scope of our project. Hopefully this will become clearer when we are more advanced.

breuera commented 2 weeks ago

We have updated the SME homepage today. It now includes results for FMLA with ZA as the destination.

scalable-analyses / sme

Use multi-register FMLA targetting ZA #3