Open weiguozhi opened 8 years ago
Currently looking at some approaches that would reduce CR-logical instruction use that will likely improve this test case.
Carrot,
the fact that we have a function that returns a bool value is a property of the reduced testcase, or is that true in the original code as well?
Yes, the original function returns bool value.
Carrot,
the fact that we have a function that returns a bool value is a property of the reduced testcase, or is that true in the original code as well?
The early patch for dag combine changes: https://reviews.llvm.org/D25221
This patch can be extended, but I post it to get some feedback before extending it.
https://reviews.llvm.org/D25200
patch for InstCombine changes. I will shortly post another patch for changes in PPC dag combine.
OK. I'll work on this to address our problem. For anything left, I will open a new bug report.
Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.
I was just looking at FoldAndOfICmps(); surprised we don't have this fold already. Is there a separate bug report for this? If not, should I file one?
Thanks Sanjay for looking. If you plan to make the change, I can open a separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree with the overall solution. If they don't we can open a bugzilla item for instcombine, so it is not forgotten.
I think we're going to need multiple fixes to get this right in instcombine, so I appreciate any bug reports that you'd like to open.
Here are a few possibilities for the patterns that we are missing:
bool andcmps1(int x, int y) { return ((x & 16) != 0) && ((y & 8) == 0); // logical-and }
bool andcmps2(int x, int y) { return ((x & 16) != 0) & ((y & 8) == 0); // bitwise-and }
bool andcmps3(int x, int y) { return ((x & 32768) != 0) && ((y & 8) == 0); // special case constant }
$ ./clang -O1 bittests.c -S -o - -emit-llvm ; ModuleID = 'bittests.c' source_filename = "bittests.c" target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.11.0"
define zeroext i1 @andcmps1(i32 %x, i32 %y) local_unnamed_addr #0 { entry: %and = and i32 %x, 16 %and1 = and i32 %y, 8 %cmp2 = icmp eq i32 %and1, 0 %not.cmp = icmp ne i32 %and, 0 %0 = and i1 %not.cmp, %cmp2 ret i1 %0 }
define zeroext i1 @andcmps2(i32 %x, i32 %y) local_unnamed_addr #0 { entry: %and = lshr i32 %x, 4 %and1 = lshr i32 %y, 3 %and1.lobit = and i32 %and1, 1 %0 = xor i32 %and1.lobit, 1 %and4 = and i32 %0, %and %tobool = icmp ne i32 %and4, 0 ret i1 %tobool }
define zeroext i1 @andcmps3(i32 %x, i32 %y) local_unnamed_addr #0 { entry: %0 = trunc i32 %x to i16 %cmp = icmp slt i16 %0, 0 %and1 = and i32 %y, 8 %cmp2 = icmp eq i32 %and1, 0 %1 = and i1 %cmp, %cmp2 ret i1 %1 }
Hal also agreed to this. I will start working on patches for this.
Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.
I was just looking at FoldAndOfICmps(); surprised we don't have this fold already. Is there a separate bug report for this? If not, should I file one?
Thanks Sanjay for looking. If you plan to make the change, I can open a separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree with the overall solution. If they don't we can open a bugzilla item for instcombine, so it is not forgotten.
Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.
I was just looking at FoldAndOfICmps(); surprised we don't have this fold already. Is there a separate bug report for this? If not, should I file one?
Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.
For the problem in comment 3, I am thinking of the following solution. I post it here to get some feedback
The IR currently generated for the testcase is the following:
define zeroext i1 @Z3fooRK3PB2S1(%class.PB2 nocapture readonly dereferenceable(16) %s_a, %class.PB2 nocapture readonly dereferenceable(16) %s_b) local_unnamed_addr #0 { entry: %arrayidx.i6 = bitcast %class.PB2 %s_a to i32 %0 = load i32, i32 %arrayidx.i6, align 8, !tbaa !1 %and.i = and i32 %0, 8 %arrayidx.i37 = bitcast %class.PB2 %s_b to i32 %1 = load i32, i32 %arrayidx.i37, align 8, !tbaa !1 %and.i4 = and i32 %1, 8 %cmp.i5 = icmp ne i32 %and.i4, 0 %cmptmp = icmp eq i32 %and.i, 0 %cmp = and i1 %cmptmp, %cmp.i5 ret i1 %cmp }
I think it is reasonable to expect InstCombine to recognize that it can remove all icmp instruction and replace them with the following:
%cmp = icmp ult i32 %and.i, %and.i4
This is not enough for us though. During lowering for this code we generate a setcc for which we generate an isel. The next step is to look at the following during lowering:
1) The user of setcc is not a branch. This value is returned to the caller. So we need the result of comparison in GPR. 2) This is a setcc with a ult condition code, applied to two i32 values. So we can instead subtract the values, and look at the sign bit (so subtract and shift). This is what GCC does when they encounter a < operation during translation of gimple to RTL. (I talked to the gcc developer who implemented this).
Does this make sense?
Here the problem is not related to cr instruction. Instead we are generating the following two instructions:
c: fe e8 63 54 rlwinm r3,r3,29,3,31 10: e0 07 63 78 clrldi r3,r3,63
While gcc generates only one
c: e2 ef 63 78 rldicl r3,r3,61,63
Thanks you, Ehsan, I also noticed this pattern in another case, but I failed to simplify it.
You're welcome. This is a really good code testcase as it reveals multiple problems.
Here the problem is not related to cr instruction. Instead we are generating the following two instructions:
c: fe e8 63 54 rlwinm r3,r3,29,3,31 10: e0 07 63 78 clrldi r3,r3,63
While gcc generates only one
c: e2 ef 63 78 rldicl r3,r3,61,63
Thanks you, Ehsan, I also noticed this pattern in another case, but I failed to simplify it.
This sequence has a more serious problem:
bool foo(const PB2& s_a, const PB2& s_b) { return s_a.has_pb1() < s_b.has_pb1(); }
LLVM code:
0: 00 00 84 88 lbz r4,0(r4) 4: 00 00 63 88 lbz r3,0(r3) 8: fe e8 84 54 rlwinm r4,r4,29,3,31 c: 38 07 63 54 rlwinm r3,r3,0,28,28 10: 00 00 83 2c cmpwi cr1,r3,0 14: 01 00 84 70 andi. r4,r4,1 18: 01 00 60 38 li r3,1 1c: c2 09 86 4e crnand 4cr5+lt,4cr1+eq,gt 20: 1e 1d 60 7c isel r3,0,r3,20 24: 20 00 80 4e blr
GCC code:
0: 00 00 23 81 lwz r9,0(r3) 4: 00 00 64 80 lwz r3,0(r4) 8: e2 ef 29 79 rldicl r9,r9,61,63 c: e2 ef 63 78 rldicl r3,r3,61,63 10: 50 48 63 7c subf r3,r3,r9 14: e0 0f 63 78 rldicl r3,r3,1,63 18: 20 00 80 4e blr
https://reviews.llvm.org/D24924
for issue in comment 1
I have reduced the function foo even further, to the shortest code, in which I can find inefficiency. Once I fix the first issue, I will add more code to it, until I find the next issue to fix.
The smallest problematic code is this:
bool foo(const PB2& s_a, const PB2& s_b) {
return s_a.has_pb1() != s_b.has_pb1();
}
Here the problem is not related to cr instruction. Instead we are generating the following two instructions:
c: fe e8 63 54 rlwinm r3,r3,29,3,31 10: e0 07 63 78 clrldi r3,r3,63
While gcc generates only one
c: e2 ef 63 78 rldicl r3,r3,61,63
assigned to @nemanjai
Extended Description
The source code is simplified from a protocol buffer test case,
class PB1{ public: bool has_f1() const; int f1() const;
bool has_f2() const; unsigned long long f2() const;
bool has_f3() const; unsigned long long f3() const;
unsigned int _hasbits[1]; unsigned long long f2; unsigned long long f3; int f1_; };
class PB2{ public: bool has_pb1() const; const PB1& pb1() const;
unsigned int _hasbits[1]; PB1* pb1_; };
inline bool PB1::has_f1() const { return (_hasbits[0] & 0x00000004u) != 0; } inline int PB1::f1() const { return f1_; } inline bool PB1::has_f2() const { return (_hasbits[0] & 0x00000008u) != 0; } inline unsigned long long PB1::f2() const { return f2_; } inline bool PB1::has_f3() const { return (_hasbits[0] & 0x00000010u) != 0; } inline unsigned long long PB1::f3() const { return f3_; } inline bool PB2::has_pb1() const { return (_hasbits[0] & 0x00000008u) != 0; } inline const PB1& PB2::pb1() const { return *pb1_; }
bool foo(const PB2& s_a, const PB2& s_b) { if (s_a.has_pb1() != s_b.has_pb1()) { return s_a.has_pb1() < s_b.has_pb1(); } if (s_a.has_pb1()) { const PB1& v_a = s_a.pb1(); const PB1& v_b = s_b.pb1(); if (v_a.has_f2() != v_b.has_f2()) { return v_a.has_f2() < v_b.has_f2(); } if (v_a.has_f2() && (v_a.f2() != v_b.f2())) { return v_a.f2() < v_b.f2(); } if (v_a.has_f3() != v_b.has_f3()) { return v_a.has_f3() < v_b.has_f3(); } if (v_a.has_f3() && (v_a.f3() != v_b.f3())) { return v_a.f3() < v_b.f3(); } if (v_a.has_f1() != v_b.has_f1()) { return v_a.has_f1() < v_b.has_f1(); } if (v_a.has_f1() && (v_a.f1() != v_b.f1())) { return v_a.f1() < v_b.f1(); } } return false; }
The function foo contains many comparisons, when compiled with options -m64 -O2 -mvsx -mcpu=power8
LLVM generates code with many slow cr manipulation instructions, following is first part of them:
Z3fooRK3PB2S1: # @Z3fooRK3PB2S1 .Lfunc_begin0:
BB#0: # %entry
BB#1: # %if.then
.LBB0_2: # %if.end crxor 22, 22, 22 bc 4, 20, .LBB0_14
BB#3: # %if.then9
BB#4: # %if.then17
It causes 40% slower than gcc. It should be handled by PPCBoolRetToInt.cpp.