kimwalisch / primesieve

🚀 Fast prime number generator
BSD 2-Clause "Simplified" License
963 stars 123 forks source link

Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

Open kimwalisch opened 2 years ago

kimwalisch commented 2 years ago

primesieve::iterator's performance depends heavily on the fillNextPrimes() method from PrimeGenerator.cpp. For x64 we have a vectorized AVX512 algorithm that is pretty optimal for this task. Once other CPU architectures (e.g. arm64) support 512-bit vector instructions like AVX512 we should port our AVX512 algorithm to these CPU architectures.

ARM has recently added (2021) the Scalable Vector Extension (SVE) to its CPUs. SVE is supposed to be a portable vector instruction set that works with different vector instructions widths. However for vectorizing our fillNextPrimes() method we need at least 512-bit vector instructions.

kimwalisch commented 9 months ago

Primesieve's AVX512 fillNextPrimes() algorithm is close to optimal for converting 1-bits from the sieve array into the corresponding bit indexes/positions (and next into prime numbers). The algorithm mainly relies on the VPCOMPRESSB and VPERMB instructions from AVX512.

According to ChatGPT the new ARM SVE instruction set has those instructions as well.

This means that it should be relatively easy to port Primesieve's AVX512 fillNextPrimes() algorithm to ARM SVE. But there is one big drawback, in ARM SVE the vector length is dynamic which adds another layer of complexity for implementing this algorithm.

kimwalisch commented 9 months ago

According to ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads, as of 2024 there are no ARM CPUs out yet that support SVE or SV2. And SVE2 will likely be limited to 128-bits for the next few years, on top of that the dynamic vector length of ARM SVE is a pain in the ass in many cases (including its use in primesieve). Maybe these issues explain why there is no wide adoption of ARM SVE yet...

It is best to wait a few years until ARM CPUs ship a vector extension of at least 256-bits (ideally we want 512-bits like in AVX512.). Ideally we want a fixed vector length instruction set, similar to ARM NEON or AVX512.