llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.72k stars 11.88k forks source link

Avoid generating branch in branchless code #40455

Open davidbolvansky opened 5 years ago

davidbolvansky commented 5 years ago
Bugzilla Link 41110
Version trunk
OS Linux
CC @adibiagio,@topperc,@RKSimon,@rotateright

Extended Description

struct edge {
    int from;
    int to;
};

int edge_to_rank(struct edge e, size_t n) {
    return e.from < e.to ? e.to - 1 : e.from - 1 + n / 2;
}

Clang -O3:

edge_to_rank(edge, unsigned long): # @edge_to_rank(edge, unsigned long)
  mov rax, rdi
  movabs rcx, -4294967296
  mov rdx, rdi
  shr rdx, 32
  cmp eax, edx
  jge .LBB0_2
  add rax, rcx
  sar rax, 32
  ret
.LBB0_2:
  shl rax, 32
  add rax, rcx
  sar rax, 32
  shr rsi
  add rax, rsi
  ret

GCC -O3:

edge_to_rank(edge, unsigned long):
  mov rax, rdi
  sar rax, 32
  cmp edi, eax
  jge .L2
  dec eax
  ret
.L2:
  shr rsi
  lea eax, [rdi-1+rsi]
  ret

ICC -O3

edge_to_rank(edge, unsigned long):
        mov       rax, rdi                                      #9.21
        movsxd    rdx, edi                                      #9.48
        shr       rsi, 1                                        #9.56
        shr       rax, 32                                       #9.21
        movsxd    rcx, eax                                      #9.35
        dec       rcx                                           #9.35
        cmp       edi, eax                                      #9.21
        lea       r8, QWORD PTR [-1+rdx+rsi]                    #9.56
        cmovl     r8, rcx                                       #9.21
        mov       eax, r8d                                      #9.21
        ret     

I think Clang trunk generates supoptimal code. GCC's output is small and nice (maybe three op LEA should not be used), but with a branch...

ICC generates the code I would expect - branchless using cmov.

https://godbolt.org/z/z6uzsn

davidbolvansky commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#41350

rotateright commented 5 years ago

Any advice how to avoid / break infinite loop?

I don't see a way to do this in general (DAGCombiner or X86ISelLowering) because reversing the order would break a lot of existing folds. It might be possible to add a late and limited match for this pattern to X86ISelDAGToDAG.cpp (X86DAGToDAGISel class).

davidbolvansky commented 5 years ago
davidbolvansky commented 5 years ago
  // (add (shl X, C2), C1) -> shl (add X, C1 >> C2), C2
  // iff (C1 & ((1<<C2)-1)) == 0
  ConstantSDNode *N1C = isConstOrConstSplat(N1);
  if (N0.getOpcode() == ISD::SHL && N1C &&
      TLI.isDesirableToCommuteWithShift(N0.getNode(), Level)) {
    ConstantSDNode *ShiftAmt = isConstOrConstSplat(N0.getOperand(1));
    unsigned BitWidth = N1C->getAPIntValue().getBitWidth();
    if (ShiftAmt && ((N1C->getAPIntValue() &
         (APInt::getSignMask(ShiftAmt->getSExtValue()).zextOrTrunc(BitWidth) -
          1)) == 0)) {
      SDValue Shift = DAG.FoldConstantArithmetic(ISD::SRA, DL, VT, N1.getNode(),
                                                 N0.getOperand(1).getNode());
      SDValue Add = DAG.getNode(ISD::ADD, DL, VT, N0.getOperand(0), Shift);
      AddToWorklist(Shl0.getNode());
      return DAG.getNode(ISD::SHL, DL, VT, Add, N0.getOperand(1));
    }
  }

With

define i64 @ok1(i64 %x)  {
  %shl = shl i64 %x, 32
  %add = add i64 %shl, -4294967296
  ret i64 %add
}

Clang produces:

    leal    -1(%rdi), %eax
    shlq    $32, %rax

Sadly, with:

define i64 @ok2(i64 %x)  {
  %add = shl i64 %x, 5
  %shl = add i64 %add, 32
  ret i64 %shl
}

Clang/LLC goes to infinite loop (one pattern always reverses other one).

Any advice how to avoid / break infinite loop?

rotateright commented 5 years ago

Probably doesn't do anything for the motivating problem, but this 64-bit constant op: movabs rcx, -4294967296 shl rax, 32 add rax, rcx

...is bogus. We should be moving 'add' before 'shl' late in combining to get a 32-bit constant: leal -1(%rdi), %eax shlq $32, %rax

That likely requires carving out an exception using the isDesirableToCommuteWithShift() hook. Otherwise, we'll have opposing transforms.

https://rise4fun.com/Alive/f9a

topperc commented 5 years ago

I think we had decent IR at one point that should have gotten us close to the gcc output, but then canEvaluateSExtd got a little over excited and turned everything into 64-bit. Inserting the shl 32 and sar 32 to do so.

I think canEvaluateSExtd decided that because it found a truncate from i64 to i32 earlier in the computation, that it was a good idea to promote everything to i64. Even though the truncate had other users and we had to use shifts to still do the sign extension in the i64 value.

We might also need to teach InstCombine's visitTruncate to simplify (trunc (phi (sext X), Y)) to (phi X, (trunc Y)).

RKSimon commented 5 years ago

Yeah, we shouldn't be repeating code like that, although depending on whether we end up with a branch or cmov probably affects how/when we merge.

We have been finding cases where cmov isn't great if we have access to prof metadata that hints that its not 50:50 so gcc