This PR merges several experiments to implement modular addition in pure LLVM IR so that instead of writing an assembly backend for each target we can generate multiplatform code from LLVM IR, especially for ARM and AMDGPU as they support addition-with-carry and also SIMD, without doing vectorization myself.
432a91e3a935915c91e2946dd4144ee1cf10cb07 implements word-level (so 64-bit by 64-bit) modular addition with inlined adc/sbb "super-instructions" (and mulExt/muladd1/muladd2/mulacc). This generates optimal code on x86-64 but not on ARM due to https://github.com/llvm/llvm-project/issues/102062, and the alloca needs to be specialized to addrspace(5) on AMDGPUs: https://github.com/llvm/llvm-project/issues/102058
0354d5b refactors the IR to add callable functions, linkage, calling conventions and boilerplate reductions
08b8671817014a43fb7f8fd5d389d1e024470173 introduces inline/alwaysinline functions failing to work around the bad codegen. Unfortunately inlining breaks instruction sub/icmp fusion into a single sub-with-borrow
a76cfd87261a0c9d357da65286e9e8f02bbb4258 partially works around inlining and instruction fusion by not using an inlining pass, only alwaysinline
b415418 materializes addcarry/subborrow as actual functions in LLVM IR instead of being just a sequence of instructions. Unfortunately it doesn't optimize well on ARM https://github.com/llvm/llvm-project/issues/102062
480ede5 uses builtin llvm.uadd.with.overflow llvm.usub.with.overflow to try to workaround bad codegen. While it is portable to x86 and ARM for up to 256-bit primes, there are 33% extra instructions for prime fields beyond that threshold https://github.com/llvm/llvm-project/issues/103717
This PR merges several experiments to implement modular addition in pure LLVM IR so that instead of writing an assembly backend for each target we can generate multiplatform code from LLVM IR, especially for ARM and AMDGPU as they support addition-with-carry and also SIMD, without doing vectorization myself.
That was to try to address:
Unfortunately, compilers are still inefficient at translating modular addition into optimal code. See https://github.com/mratsim/constantine/issues/357:
Description of experiments: