Missed optimization: inefficient codegen for __builtin_addc

llvmbot commented 6 years ago


Bugzilla Link	36243
Version	trunk
OS	All
Reporter	LLVM Bugzilla Contributor
CC	@davezarzycki,@hfinkel,@RKSimon,@nemanjai,@rotateright
Fixed by commit(s)	rG257acbf6aee9

Extended Description

clang has __builtin_addc* functions, which are supposed to emit hardware add-with-carry instructions. However, there is no corresponding intrinsic on LLVM side, so clang emits a sequence of instructions that is only recognized and folded to a single hw instruction in two cases:

carry input is 0, or
carry output is unused

This means that any carry chains longer than 2 result in inefficient code:

void add3(
    unsigned long long *restrict a,
    unsigned long long *restrict b,
    unsigned long long *restrict c
) {
    unsigned long long cf = 0;
    c[0] = __builtin_addcll(a[0], b[0], cf, &cf);
    c[1] = __builtin_addcll(a[1], b[1], cf, &cf);
    c[2] = __builtin_addcll(a[2], b[2], cf, &cf);
}

Compiles to:

add3:
        .cfi_startproc
# BB#0:
        movq    (%rdi), %rax
        movq    (%rsi), %r8
        leaq    (%rax,%r8), %rcx
        movq    %rcx, (%rdx)
        movq    8(%rdi), %rcx
        addq    8(%rsi), %rcx
        setb    %r9b
        addq    %r8, %rax
        adcq    $0, %rcx
        setb    %al
        orb     %r9b, %al
        movzbl  %al, %eax
        movq    %rcx, 8(%rdx)
        movq    16(%rsi), %rcx
        addq    16(%rdi), %rcx
        addq    %rax, %rcx
        movq    %rcx, 16(%rdx)
        retq

I suppose we're going to need a new target-independent generic intrinsic, say { iX, i1 } @llvm.uadd.with.overflow.carry.iX(iX, iX, i1) (and a corresponding one for subtraction as well) and map it to ISD::ADDE / ISD::ADDCARRY.

nemanjai commented 4 years ago

We aren't currently working on moving away from the old carry setting/using nodes to the new ones, but we do not have any fundamental issue with doing so. I am not sure this statement is any help though.

RKSimon commented 4 years ago

Adding some PPC guys who might know whats best to do to improve PPC support and reduce the branching.

Do we need to extend the ISD::ADD/SUBCARRY combines (where possible) to better support ISD::ADD/SUBE or can we move PPC away from ISD::ADD/SUBC + ISD::ADD/SUBE?

RKSimon commented 4 years ago

rG257acbf6aee9 solves much of the poor codegen generically in DAGCombine

x86/arm code works as expected

ppc struggles as the target still needs to be updated to use ADDCARRY/SUBCARRY

Andy - maybe open a PPC specific bug?

RKSimon commented 4 years ago

Current Codegen: https://godbolt.org/z/cVSccY

davezarzycki commented 4 years ago

FYI – https://reviews.llvm.org/D70079

llvmbot commented 6 years ago

I have also found this. Here is the code I am using:

#include <stdint.h>

struct UInt96
  {
  uint32_t d2, d1, d0;
  };

void ADD96_v1(const UInt96& s, UInt96& d)
  {
  uint64_t sum = uint64_t(d.d0) + uint64_t(s.d0);
  d.d0 = sum; sum >>= 32;
  sum += uint64_t(d.d1) + uint64_t(s.d1);
  d.d1 = sum; sum >>= 32;
  sum += uint64_t(d.d2) + uint64_t(s.d2);
  d.d2 = sum;
  }

void ADD96_v2(const UInt96& s, UInt96& d)
  {
  uint32_t carry;

  d.d0 = __builtin_addc(d.d0, s.d0, 0, &carry);
  d.d1 = __builtin_addc(d.d1, s.d1, carry, &carry);
  d.d2 = __builtin_addc(d.d2, s.d2, carry, &carry);
  }

ADD96_v1 was my original function, and ADD96_v2 is my attempt to rewrite it using __builtin_addc. While the second version works, I am surprised that it is less optimal than the first version: using the Godbolt Compiler Explorer, the generated ARM code contains an additional three seemingly unnecessary instructions; worse the PPC code utilises branch operations!

My compiler options are:

-target arm-unknown-linux-gnueabihf -mcpu=cortex-a9 -O2

and:

-target powerpc-unknown-linux-gnu -mcpu=603e -O2

It seems that the __builtin_addc* functions don't really have an advantage; is that true?

llvm / llvm-project

Missed optimization: inefficient codegen for __builtin_addc #35591

Extended Description