mratsim / constantine

Constantine: modular, high-performance, zero-dependency cryptography stack for verifiable computation, proof systems and blockchain protocols.
Other
389 stars 43 forks source link

SIMD Vectorization - Use Integer Fused Multiply-Add (AVX512) #427

Open mratsim opened 2 months ago

mratsim commented 2 months ago

It might be quite interesting to explore SIMD vectorization for elliptic curves and MSMs. This might significantly speed-up:

without needing a GPU. Ideally the same optimizations are written in "portable" intrinsics so that can be used with ARM Neon as well. Speedup can be done either horizontally, i.e. processing 8 points in parallel (8*64 = 512) or within some subroutines try to use as much parallelism as possible, i.e. on Fp2 or at the Elliptic curve level.

This probably requires full support for signed unsaturated arithmetic which is partially supported here:

CPU support:

While recent Intel CPUs (AlderLake and later) don't support AVX512, it might be that they support the 256-bit version of IFMA

Papers:

mratsim commented 1 month ago

Update following discussion with @Vindaar

Given that our numbers are actually small in size, just 381-bit in production, AVX512-IFMA is actually unnecessary:

Instead we should just use addcarry/subborrow/extended_multiplication and process 4, 8 or 16 field elements at once.

mratsim commented 1 month ago

Actually there is no addcarry so the sequence will be similar to the following

https://github.com/mratsim/constantine/blob/0354d5b25a801a14ec3b715c2f5ded3fed54f804/constantine/platforms/llvm/super_instructions.nim#L82-L93

So 2 addition, 2 comparison, 1 or per addcarry, this is too costly and unsaturated arithmetic will be necessary, so IFMA is back on the table.