Open mratsim opened 2 years ago
Relevant:
https://eprint.iacr.org/2022/439.pdf - Efficient Multiplication of Somewhat Small Integers using Number-Theoretic Transforms
https://eprint.iacr.org/2021/1355.pdf - Curve448 on 32-bit ARM Cortex-M4
https://tches.iacr.org/index.php/TCHES/article/view/9295/8861 - Neon NTT: Faster Dilithium, Kyber, and Saber on Cortex-A72 and Apple M1
https://eprint.iacr.org/2021/561.pdf - Kyber on ARM64
https://eprint.iacr.org/2019/721.pdf - Optimized SIKE Round 2 on 64-bit ARM
https://github.com/Mbed-TLS/mbedtls/issues/5666 - Improve Montgomery multiplication strategy with UMAAL instruction for fused {C|D} <- A*B + C + D
https://github.com/Mbed-TLS/mbedtls/issues/5360 - Improve inline assembly for Cortex-M + DSP
https://eprint.iacr.org/2021/185.pdf is particularly interesting regarding general ARM CPUs and Apple CPUs:
Multiplications are 3x slower than addition on Rpi4 but have sensibly the same speed on Apple CPUs.
https://github.com/mratsim/constantine/pull/69 introduced an assembly ode generator for x86 and x86-64 at https://github.com/mratsim/constantine/blob/7d29cb9/constantine/platforms/isa/macro_assembler_x86.nim
We need the same for ARM for efficiency on Raspberry Pi, Phones, Apple Silicon and other resource-restricted devices.
Efficient multiplication on ARM:
Related papers:
https://eprint.iacr.org/2021/185.pdf