Open llvmbot opened 6 years ago
We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.
Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching.
Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?
rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine
x86/arm code works as expected
ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY
Andy - maybe open a PPC specific bug?
Current Codegen: https://godbolt.org/z/cVSccY
I have also found this. Here is the code I am using:
#include <stdint.h>
struct UInt96
{
uint32_t d2, d1, d0;
};
void ADD96_v1(const UInt96& s, UInt96& d)
{
uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0);
d.d0 = sum; sum >>= 32;
sum += uint64_t(d.d1) + uint64_t(s.d1);
d.d1 = sum; sum >>= 32;
sum += uint64_t(d.d2) + uint64_t(s.d2);
d.d2 = sum;
}
void ADD96_v2(const UInt96& s, UInt96& d)
{
uint32_t carry;
d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry);
d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry);
d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry);
}
ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it using __builtin_addc. While the second version works, I am surprised that it is less optimal than the first version: using the Godbolt Compiler Explorer, the generated ARM code contains an additional three seemingly unnecessary instructions; worse the PPC code utilises branch operations!
My compiler options are:
-target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2
and:
-target powerpc-unknown-linux-gnu -mcpu=603e -O2
It seems that the __builtin_addc* functions don't really have an advantage; is that true?
Extended Description
clang has __builtin_addc* functions, which are supposed to emit hardware add-with-carry instructions. However, there is no corresponding intrinsic on LLVM side, so clang emits a sequence of instructions that is only recognized and folded to a single hw instruction in two cases:
This means that any carry chains longer than 2 result in inefficient code:
Compiles to:
I suppose we're going to need a new target-independent generic intrinsic, say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a corresponding one for subtraction as well) and map it to ISD::ADDE / ISD::ADDCARRY.