FMA thoughts - Githubissues

Thanks for the well documented and elegantly coded repository. The documentation correctly notes that FMA is a single operation for ARM, but not for SSE.

One option would be to use AVX FMA3 instructions provided with modern AMD/Intel CPUs. The FreePascal code below illustrates this. I can understand why you might not want to do this - SSE is available with every x86_64 CPU, while AVX requires a more recent computer (and therefore one would want to detect this feature).

Any plans to extend this repository to combine other AVX instructions? On one hand, the 128-bit SSE is perfect for your 4-component singles, but I would have thought the 256-bits would allow you to tackle double precision values.

program avxTst;
{$mode delphi}   
uses sysutils;

//https://stackoverflow.com/questions/41468871/how-to-detect-avx2-in-delphi
{$ASMMODE INTEL}
function IsAVX2supported: boolean;
asm
    // Save EBX
    {$IFDEF CPUx86}
      push ebx
    {$ELSE CPUx64}
      mov r10, rbx
    {$ENDIF}
    //Check CPUID.0
    xor eax, eax
    cpuid //modifies EAX,EBX,ECX,EDX
    cmp al, 7 // do we have a CPUID leaf 7 ?
    jge @Leaf7
      xor eax, eax
      jmp @Exit
    @Leaf7:
      //Check CPUID.7
      mov eax, 7h
      xor ecx, ecx
      cpuid
      bt ebx, 5 //AVX2: CPUID.(EAX=07H, ECX=0H):EBX.AVX2[bit 5]=1
      setc al
   @Exit:
   // Restore EBX
   {$IFDEF CPUx86}
     pop ebx
   {$ELSE CPUx64}
     mov rbx, r10
   {$ENDIF}
end;

type
TVec4 = packed record 
    x,y,z,w: single;
end;

function v(x,y,z,w:single): TVec4;
begin
    result.x := x;
    result.y := y;
    result.z := z;
    result.w := w;
end;

{$ASMMODE INTEL}
function FMA(const A, B, C: TVec4): TVec4;
//result := (A * B) + C
//https://neslib.github.io/FastMath/Neslib.FastMath.html#FMA
//https://en.wikipedia.org/wiki/FMA_instruction_set#Excerpt_from_FMA3
begin
    asm
        movups    xmm0, [C]
        movups    xmm1, [B]
        movups    xmm2, [A]
        vfmadd231ps xmm0, xmm1, xmm2 //a = b·c + a  
        movhlps xmm1, xmm0
        vzeroupper      
    end;
end;

procedure tst;
var
    a,b,c,d: TVec4;
begin
    a := v(1,2,3,4);
    b := v(5,10,10,10);
    c := v(22,1,1,1);
    d := FMA(a,b,c);
    writeln(format('xyzw: %g %g %g %g', [d.x, d.y, d.z, d.w])); 
end;

begin
    if not IsAVX2supported then begin
        writeln('AVX2 is not supported');
        exit;
    end;
    tst;
end.

Glad you like the repo. I currently don't have plans to support AVX or later instruction sets, or to support double-precision math. Some functionality in FastMath may benefit from AVX instructions, but in general, the small gain in speed is not worth the extra effort IMHO.

It's a lot of work designing and writing assembly routines for 4 different architectures alone (x86, x64, NEON and ARM64). Supporting newer instruction sets makes that even harder. Also, I settled on an instruction set (SSE2) that virtually all Intel/AMD-based computers support nowadays. Like you said, we would have to perform CPU capability checks to add support for AVX and other instruction sets. That would also mean we would have to dynamically dispatch routines to different versions based on CPU caps. Since most routines are pretty small, we have to take the additional overhead of this dynamic dispatch into account. Or maybe use conditional defines to force an instruction set (but then the app will crash on computers that don't support it).

I may add that in the future to support wider registers and double-precision floating-point math.

In the meantime, I may add AVX support for specific use cases, like the FMA example you provided. I will look into this when I have some time...

neslib / FastMath

FMA thoughts #2