The VFMAS instruction is quite rarely used compared to VFMA, and when it is applicable it will often be used with a constant scalar, eg in a Newton-Raphson inverse square root approximation step:
x = x * (1.5 - 0.5 * x * x) // Can use VMUL ; VFMAS ; VMUL
MVE gives us two VFMA forms (V * V + V, V * S + V), and the VFMAS form V * V + S. A key difference is which input register is modified - for VFMA it's the addend, while VFMAS writes back to one of the multiplicands.
Clang can generate VFMAS either from the vfmasq intrinsic, or from a float32x4_t * float32x4_t + float32_t expression, but only if the scalar is not a known constant. If the scalar is constant, it is always loaded into a vector register, and it uses the all-vector VFMA, even though this inevitably means accompanying every VFMA with a VMOVas the constant addend gets overwritten.
I've had no success in generating a VFMAS instruction for a constant scalar, so my Newton-Raphson iterations are VMUL; VMOV; VFMA; VMUL.
Non-constant scalar:
float32x4_t func3(float32x4_t a, float32x4_t b, float32_t c)
{
a = vfmasq(a,b,c);
a = vfmasq(a,b,c);
return vfmasq(a,b,c);
}
The issue appears to be quite specific to VFMAS. VMLAS.I32 Qda,Qn,Rm and VMUL.F32 Qd,Qn,Rm are generated as you'd expect with constant or variable scalars.
The VFMAS instruction is quite rarely used compared to VFMA, and when it is applicable it will often be used with a constant scalar, eg in a Newton-Raphson inverse square root approximation step:
MVE gives us two VFMA forms (
V * V + V
,V * S + V
), and the VFMAS formV * V + S
. A key difference is which input register is modified - for VFMA it's the addend, while VFMAS writes back to one of the multiplicands.Clang can generate VFMAS either from the
vfmasq
intrinsic, or from afloat32x4_t * float32x4_t + float32_t
expression, but only if the scalar is not a known constant. If the scalar is constant, it is always loaded into a vector register, and it uses the all-vectorVFMA
, even though this inevitably means accompanying everyVFMA
with aVMOV
as the constant addend gets overwritten.I've had no success in generating a VFMAS instruction for a constant scalar, so my Newton-Raphson iterations are VMUL; VMOV; VFMA; VMUL.
Non-constant scalar:
Constant scalar:
More examples at https://godbolt.org/z/cc5navr54
The issue appears to be quite specific to VFMAS.
VMLAS.I32 Qda,Qn,Rm
andVMUL.F32 Qd,Qn,Rm
are generated as you'd expect with constant or variable scalars.