Open 54aefcd4-c07d-4252-8441-723563c8826f opened 5 years ago
Even that's not enough. We need to use jrcxz for the loop control and an lea for the index adjustment. And we need to keep flags alive across basic block boundaries which I don't think we usually do.
Hmm. Indeed. LLVM should also emit a proper mulx here for adcx to make sense.
The X86 backend isn't currently set up to model the C flag and O flag separately. We model all of the flags as one register. Because of this we can't interleave the flag dependencies. We would need to do something about that before it makes sense to implement _addcarryx_u64 as anything other than plain adc.
Is it possible to have custom legalization of e.g. mul i256
that uses mulx
, adcx
and adox
?
Extended Description
MP multiply is documented by Intel as one of the main usecases of addcarryx intrinsics (https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf).
We implement this in Rust as:
which produces the following LLVM-IR after optimizations (https://rust.godbolt.org/z/EJFEHB):
which gets compiled down to the following machine code:
Implementing this operation using inline assembly with the expected machine code output:
and benchmarking both (https://github.com/rust-lang-nursery/stdsimd/issues/666#issuecomment-485065551) shows a significant performance difference: 390ns/iteration for the inline assembly version vs 507ns/iteration for the one using
llvm.x86.addcarryx.u64
.It appears that LLVM always replaces
llvm.x86.addcarryx.u64
with a polyfill based onllvm.x86.addcarry.u64
and then fails to emit adcx instructions.