llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.34k stars 11.7k forks source link

ARM MVE - VFMAS instruction never generated if the scalar is constant #61218

Open kjbracey opened 1 year ago

kjbracey commented 1 year ago

The VFMAS instruction is quite rarely used compared to VFMA, and when it is applicable it will often be used with a constant scalar, eg in a Newton-Raphson inverse square root approximation step:

x = x * (1.5 - 0.5 * x * x)            // Can use VMUL ; VFMAS ; VMUL

MVE gives us two VFMA forms (V * V + V, V * S + V), and the VFMAS form V * V + S. A key difference is which input register is modified - for VFMA it's the addend, while VFMAS writes back to one of the multiplicands.

Clang can generate VFMAS either from the vfmasq intrinsic, or from a float32x4_t * float32x4_t + float32_t expression, but only if the scalar is not a known constant. If the scalar is constant, it is always loaded into a vector register, and it uses the all-vector VFMA, even though this inevitably means accompanying every VFMA with a VMOVas the constant addend gets overwritten.

I've had no success in generating a VFMAS instruction for a constant scalar, so my Newton-Raphson iterations are VMUL; VMOV; VFMA; VMUL.

Non-constant scalar:

float32x4_t func3(float32x4_t a, float32x4_t b, float32_t c)
{
    a = vfmasq(a,b,c);
    a = vfmasq(a,b,c);
    return vfmasq(a,b,c);
}
func3:
        vmov    r0, s8
        vfmas.f32       q0, q1, r0
        vfmas.f32       q0, q1, r0
        vfmas.f32       q0, q1, r0
        bx      lr

Constant scalar:

float32x4_t func1(float32x4_t a, float32x4_t b)
{
    a = vfmasq(a,b,1.5f);
    a = vfmasq(a,b,1.5f);
    return vfmasq(a,b,1.5f);
}
func1:
        vmov.f32        q2, #1.500000e+00
        vmov    q3, q2
        vfma.f32        q3, q0, q1
        vmov    q0, q2
        vfma.f32        q0, q3, q1
        vfma.f32        q2, q0, q1
        vmov    q0, q2
        bx      lr

More examples at https://godbolt.org/z/cc5navr54

The issue appears to be quite specific to VFMAS. VMLAS.I32 Qda,Qn,Rm and VMUL.F32 Qd,Qn,Rm are generated as you'd expect with constant or variable scalars.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-arm