Closed neurolabusc closed 4 years ago
Glad you like the repo. I currently don't have plans to support AVX or later instruction sets, or to support double-precision math. Some functionality in FastMath may benefit from AVX instructions, but in general, the small gain in speed is not worth the extra effort IMHO.
It's a lot of work designing and writing assembly routines for 4 different architectures alone (x86, x64, NEON and ARM64). Supporting newer instruction sets makes that even harder. Also, I settled on an instruction set (SSE2) that virtually all Intel/AMD-based computers support nowadays. Like you said, we would have to perform CPU capability checks to add support for AVX and other instruction sets. That would also mean we would have to dynamically dispatch routines to different versions based on CPU caps. Since most routines are pretty small, we have to take the additional overhead of this dynamic dispatch into account. Or maybe use conditional defines to force an instruction set (but then the app will crash on computers that don't support it).
I may add that in the future to support wider registers and double-precision floating-point math.
In the meantime, I may add AVX support for specific use cases, like the FMA example you provided. I will look into this when I have some time...
Thanks for the well documented and elegantly coded repository. The documentation correctly notes that FMA is a single operation for ARM, but not for SSE.
One option would be to use AVX FMA3 instructions provided with modern AMD/Intel CPUs. The FreePascal code below illustrates this. I can understand why you might not want to do this - SSE is available with every x86_64 CPU, while AVX requires a more recent computer (and therefore one would want to detect this feature).
Any plans to extend this repository to combine other AVX instructions? On one hand, the 128-bit SSE is perfect for your 4-component singles, but I would have thought the 256-bits would allow you to tackle double precision values.