Open mratsim opened 2 months ago
Update following discussion with @Vindaar
Given that our numbers are actually small in size, just 381-bit in production, AVX512-IFMA is actually unnecessary:
Instead we should just use addcarry/subborrow/extended_multiplication and process 4, 8 or 16 field elements at once.
Actually there is no addcarry so the sequence will be similar to the following
So 2 addition, 2 comparison, 1 or per addcarry, this is too costly and unsaturated arithmetic will be necessary, so IFMA is back on the table.
It might be quite interesting to explore SIMD vectorization for elliptic curves and MSMs. This might significantly speed-up:
without needing a GPU. Ideally the same optimizations are written in "portable" intrinsics so that can be used with ARM Neon as well. Speedup can be done either horizontally, i.e. processing 8 points in parallel (8*64 = 512) or within some subroutines try to use as much parallelism as possible, i.e. on Fp2 or at the Elliptic curve level.
This probably requires full support for signed unsaturated arithmetic which is partially supported here:
CPU support:
While recent Intel CPUs (AlderLake and later) don't support AVX512, it might be that they support the 256-bit version of IFMA
Papers: