Open davidbolvansky opened 5 years ago
mentioned in issue llvm/llvm-bugzilla-archive#41350
Any advice how to avoid / break infinite loop?
I don't see a way to do this in general (DAGCombiner or X86ISelLowering) because reversing the order would break a lot of existing folds. It might be possible to add a late and limited match for this pattern to X86ISelDAGToDAG.cpp (X86DAGToDAGISel class).
// (add (shl X, C2), C1) -> shl (add X, C1 >> C2), C2
// iff (C1 & ((1<<C2)-1)) == 0
ConstantSDNode *N1C = isConstOrConstSplat(N1);
if (N0.getOpcode() == ISD::SHL && N1C &&
TLI.isDesirableToCommuteWithShift(N0.getNode(), Level)) {
ConstantSDNode *ShiftAmt = isConstOrConstSplat(N0.getOperand(1));
unsigned BitWidth = N1C->getAPIntValue().getBitWidth();
if (ShiftAmt && ((N1C->getAPIntValue() &
(APInt::getSignMask(ShiftAmt->getSExtValue()).zextOrTrunc(BitWidth) -
1)) == 0)) {
SDValue Shift = DAG.FoldConstantArithmetic(ISD::SRA, DL, VT, N1.getNode(),
N0.getOperand(1).getNode());
SDValue Add = DAG.getNode(ISD::ADD, DL, VT, N0.getOperand(0), Shift);
AddToWorklist(Shl0.getNode());
return DAG.getNode(ISD::SHL, DL, VT, Add, N0.getOperand(1));
}
}
With
define i64 @ok1(i64 %x) {
%shl = shl i64 %x, 32
%add = add i64 %shl, -4294967296
ret i64 %add
}
Clang produces:
leal -1(%rdi), %eax
shlq $32, %rax
Sadly, with:
define i64 @ok2(i64 %x) {
%add = shl i64 %x, 5
%shl = add i64 %add, 32
ret i64 %shl
}
Clang/LLC goes to infinite loop (one pattern always reverses other one).
Any advice how to avoid / break infinite loop?
Probably doesn't do anything for the motivating problem, but this 64-bit constant op: movabs rcx, -4294967296 shl rax, 32 add rax, rcx
...is bogus. We should be moving 'add' before 'shl' late in combining to get a 32-bit constant: leal -1(%rdi), %eax shlq $32, %rax
That likely requires carving out an exception using the isDesirableToCommuteWithShift() hook. Otherwise, we'll have opposing transforms.
I think we had decent IR at one point that should have gotten us close to the gcc output, but then canEvaluateSExtd got a little over excited and turned everything into 64-bit. Inserting the shl 32 and sar 32 to do so.
I think canEvaluateSExtd decided that because it found a truncate from i64 to i32 earlier in the computation, that it was a good idea to promote everything to i64. Even though the truncate had other users and we had to use shifts to still do the sign extension in the i64 value.
We might also need to teach InstCombine's visitTruncate to simplify (trunc (phi (sext X), Y)) to (phi X, (trunc Y)).
Yeah, we shouldn't be repeating code like that, although depending on whether we end up with a branch or cmov probably affects how/when we merge.
We have been finding cases where cmov isn't great if we have access to prof metadata that hints that its not 50:50 so gcc
Extended Description
Clang -O3:
GCC -O3:
ICC -O3
I think Clang trunk generates supoptimal code. GCC's output is small and nice (maybe three op LEA should not be used), but with a branch...
ICC generates the code I would expect - branchless using cmov.
https://godbolt.org/z/z6uzsn