SIMD Vectorization - Use Integer Fused Multiply-Add (AVX512)

mratsim commented 2 months ago

It might be quite interesting to explore SIMD vectorization for elliptic curves and MSMs. This might significantly speed-up:

Verkle Trees
KZG
MSM

without needing a GPU. Ideally the same optimizations are written in "portable" intrinsics so that can be used with ARM Neon as well. Speedup can be done either horizontally, i.e. processing 8 points in parallel (8*64 = 512) or within some subroutines try to use as much parallelism as possible, i.e. on Fp2 or at the Elliptic curve level.

This probably requires full support for signed unsaturated arithmetic which is partially supported here:

https://github.com/mratsim/constantine/blob/405ec70/constantine/platforms/abstractions.nim#L145-L317

CPU support:

https://en.wikichip.org/wiki/x86/avx512_ifma

While recent Intel CPUs (AlderLake and later) don't support AVX512, it might be that they support the 256-bit version of IFMA

Papers:

https://cdrdv2.intel.com/v1/dl/getContent/812656?fileName=Intel_AVX512_Fast_Modular_Multiplication_Algorithm_Technology_Guide_812656v1.pdf (src: https://www.intel.com/content/www/us/en/content-details/812656/intel-avx-512-fast-modular-multiplication-technique-technology-guide.html)
https://eprint.iacr.org/2021/420.pdf Intel HEXL: Accelerating Homomorphic Encryption with Intel AVX512-IFMA52
https://ieeexplore.ieee.org/document/7563269 Accelerating Big Integer Arithmetic Using Intel IFMA Extensions
https://eprint.iacr.org/2018/335.pdf Fast modular squaring with AVX512IFMA
https://orbilu.uni.lu/bitstream/10993/52467/1/TCHES2022.pdf Highly Vectorized SIKE for AVX-512
https://www.usenix.org/system/files/sec24fall-prepub-604-zhang-jipeng.pdf include formal verification of AVX512-IFMA field and EC operations
https://github.com/dalek-cryptography/curve25519-dalek/blob/5b7082b/curve25519-dalek/docs/ifma-notes.md

mratsim commented 1 month ago

Update following discussion with @Vindaar

Given that our numbers are actually small in size, just 381-bit in production, AVX512-IFMA is actually unnecessary:

It's helpful to multiply for a single large number, say 1024 bits, see https://github.com/vkrasnov/vpmadd/blob/master/vpmadd_mul1024.s#L204-L237
It requires dealing with unsaturated arithmetic

Instead we should just use addcarry/subborrow/extended_multiplication and process 4, 8 or 16 field elements at once.

mratsim commented 1 month ago

Actually there is no addcarry so the sequence will be similar to the following

https://github.com/mratsim/constantine/blob/0354d5b25a801a14ec3b715c2f5ded3fed54f804/constantine/platforms/llvm/super_instructions.nim#L82-L93

So 2 addition, 2 comparison, 1 or per addcarry, this is too costly and unsaturated arithmetic will be necessary, so IFMA is back on the table.

mratsim / constantine

SIMD Vectorization - Use Integer Fused Multiply-Add (AVX512) #427