Well, it doesn't quite translate into that and it's a bit interleaved, but overall, we need two additions per iteration and instead of keeping the carry in the carry CPU flag at all times, it's being "spilled" into a register.
Relevant passes
Prior to x86 instruction selection, the adc variant still uses two additions per iteration:
Presumably, the i128 addition would already need to be broken up into separate uadd.with.overflow calls prior to instruction selection to get the optimal codegen.
https://godbolt.org/z/vovWPxexE
Expected output
Firstly, the expected "canonical" output which we got both for
_BitInt(512)
additions and__builtin_addcll
looks as follows:We get one
adc
per loop iteration, which is the theoretical optimum here.Actual output for 128-bit integers
When attempting to get the same codegen with 128-bit addition, we don't get the optimum:
This compiles to:
LLVM emits
adc
here, however, two per loop iteration.Basically,
u128(carry) + a[i] + b[i]
turns into:Well, it doesn't quite translate into that and it's a bit interleaved, but overall, we need two additions per iteration and instead of keeping the carry in the carry CPU flag at all times, it's being "spilled" into a register.
Relevant passes
Prior to x86 instruction selection, the
adc
variant still uses two additions per iteration:x86-isel combines these into
The 128-bit variant prior to instruction selection looks like this, per iteration:
Presumably, the
i128
addition would already need to be broken up into separateuadd.with.overflow
calls prior to instruction selection to get the optimal codegen.