llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.41k stars 11.74k forks source link

[ppc] too many cr manipulation instructions are used for comparisons #29831

Open weiguozhi opened 8 years ago

weiguozhi commented 8 years ago
Bugzilla Link 30483
Version trunk
OS Linux
CC @hfinkel,@nemanjai,@rotateright

Extended Description

The source code is simplified from a protocol buffer test case,

class PB1{ public: bool has_f1() const; int f1() const;

bool has_f2() const; unsigned long long f2() const;

bool has_f3() const; unsigned long long f3() const;

unsigned int _hasbits[1]; unsigned long long f2; unsigned long long f3; int f1_; };

class PB2{ public: bool has_pb1() const; const PB1& pb1() const;

unsigned int _hasbits[1]; PB1* pb1_; };

inline bool PB1::has_f1() const { return (_hasbits[0] & 0x00000004u) != 0; } inline int PB1::f1() const { return f1_; } inline bool PB1::has_f2() const { return (_hasbits[0] & 0x00000008u) != 0; } inline unsigned long long PB1::f2() const { return f2_; } inline bool PB1::has_f3() const { return (_hasbits[0] & 0x00000010u) != 0; } inline unsigned long long PB1::f3() const { return f3_; } inline bool PB2::has_pb1() const { return (_hasbits[0] & 0x00000008u) != 0; } inline const PB1& PB2::pb1() const { return *pb1_; }

bool foo(const PB2& s_a, const PB2& s_b) { if (s_a.has_pb1() != s_b.has_pb1()) { return s_a.has_pb1() < s_b.has_pb1(); } if (s_a.has_pb1()) { const PB1& v_a = s_a.pb1(); const PB1& v_b = s_b.pb1(); if (v_a.has_f2() != v_b.has_f2()) { return v_a.has_f2() < v_b.has_f2(); } if (v_a.has_f2() && (v_a.f2() != v_b.f2())) { return v_a.f2() < v_b.f2(); } if (v_a.has_f3() != v_b.has_f3()) { return v_a.has_f3() < v_b.has_f3(); } if (v_a.has_f3() && (v_a.f3() != v_b.f3())) { return v_a.f3() < v_b.f3(); } if (v_a.has_f1() != v_b.has_f1()) { return v_a.has_f1() < v_b.has_f1(); } if (v_a.has_f1() && (v_a.f1() != v_b.f1())) { return v_a.f1() < v_b.f1(); } } return false; }

The function foo contains many comparisons, when compiled with options -m64 -O2 -mvsx -mcpu=power8

LLVM generates code with many slow cr manipulation instructions, following is first part of them:

Z3fooRK3PB2S1: # @​Z3fooRK3PB2S1 .Lfunc_begin0:

BB#0: # %entry

lbz 5, 0(3)
lbz 6, 0(4)
srwi 7, 5, 3
xor 5, 5, 6
srwi 6, 6, 3
andi. 7, 7, 1
srwi 5, 5, 3
crmove   20, 1
andi. 6, 6, 1
crmove   21, 1
andi. 5, 5, 1
bc 4, 1, .LBB0_2

BB#1: # %if.then

crandc 22, 21, 20
b .LBB0_14

.LBB0_2: # %if.end crxor 22, 22, 22 bc 4, 20, .LBB0_14

BB#3: # %if.then9

ld 3, 8(3)
ld 4, 8(4)
lwz 6, 0(3)
lwz 5, 0(4)
rlwinm 8, 6, 29, 31, 31
rlwinm 7, 6, 0, 28, 28
andi. 8, 8, 1
xor 7, 7, 5
srwi 8, 5, 3
srwi 7, 7, 3
crmove   20, 1
andi. 8, 8, 1
crmove   21, 1
andi. 7, 7, 1
bc 4, 1, .LBB0_5

BB#4: # %if.then17

crandc 22, 21, 20
b .LBB0_14
    ...

It causes 40% slower than gcc. It should be handled by PPCBoolRetToInt.cpp.

nemanjai commented 7 years ago

Currently looking at some approaches that would reduce CR-logical instruction use that will likely improve this test case.

weiguozhi commented 8 years ago

Carrot,

the fact that we have a function that returns a bool value is a property of the reduced testcase, or is that true in the original code as well?

Yes, the original function returns bool value.

llvmbot commented 8 years ago

Carrot,

the fact that we have a function that returns a bool value is a property of the reduced testcase, or is that true in the original code as well?

llvmbot commented 8 years ago

The early patch for dag combine changes: https://reviews.llvm.org/D25221

This patch can be extended, but I post it to get some feedback before extending it.

llvmbot commented 8 years ago

https://reviews.llvm.org/D25200

patch for InstCombine changes. I will shortly post another patch for changes in PPC dag combine.

llvmbot commented 8 years ago

OK. I'll work on this to address our problem. For anything left, I will open a new bug report.

rotateright commented 8 years ago

Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.

I was just looking at FoldAndOfICmps(); surprised we don't have this fold already. Is there a separate bug report for this? If not, should I file one?

Thanks Sanjay for looking. If you plan to make the change, I can open a separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree with the overall solution. If they don't we can open a bugzilla item for instcombine, so it is not forgotten.

I think we're going to need multiple fixes to get this right in instcombine, so I appreciate any bug reports that you'd like to open.

Here are a few possibilities for the patterns that we are missing:

bool andcmps1(int x, int y) { return ((x & 16) != 0) && ((y & 8) == 0); // logical-and }

bool andcmps2(int x, int y) { return ((x & 16) != 0) & ((y & 8) == 0); // bitwise-and }

bool andcmps3(int x, int y) { return ((x & 32768) != 0) && ((y & 8) == 0); // special case constant }

$ ./clang -O1 bittests.c -S -o - -emit-llvm ; ModuleID = 'bittests.c' source_filename = "bittests.c" target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.11.0"

define zeroext i1 @​andcmps1(i32 %x, i32 %y) local_unnamed_addr #​0 { entry: %and = and i32 %x, 16 %and1 = and i32 %y, 8 %cmp2 = icmp eq i32 %and1, 0 %not.cmp = icmp ne i32 %and, 0 %0 = and i1 %not.cmp, %cmp2 ret i1 %0 }

define zeroext i1 @​andcmps2(i32 %x, i32 %y) local_unnamed_addr #​0 { entry: %and = lshr i32 %x, 4 %and1 = lshr i32 %y, 3 %and1.lobit = and i32 %and1, 1 %0 = xor i32 %and1.lobit, 1 %and4 = and i32 %0, %and %tobool = icmp ne i32 %and4, 0 ret i1 %tobool }

define zeroext i1 @​andcmps3(i32 %x, i32 %y) local_unnamed_addr #​0 { entry: %0 = trunc i32 %x to i16 %cmp = icmp slt i16 %0, 0 %and1 = and i32 %y, 8 %cmp2 = icmp eq i32 %and1, 0 %1 = and i1 %cmp, %cmp2 ret i1 %1 }

llvmbot commented 8 years ago

Hal also agreed to this. I will start working on patches for this.

llvmbot commented 8 years ago

Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.

I was just looking at FoldAndOfICmps(); surprised we don't have this fold already. Is there a separate bug report for this? If not, should I file one?

Thanks Sanjay for looking. If you plan to make the change, I can open a separate bugzilla item for this. If not, let's wait to see if Hal/Kit agree with the overall solution. If they don't we can open a bugzilla item for instcombine, so it is not forgotten.

rotateright commented 8 years ago

Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.

I was just looking at FoldAndOfICmps(); surprised we don't have this fold already. Is there a separate bug report for this? If not, should I file one?

llvmbot commented 8 years ago

Talked to David Majnemer on IRC about the InstCombine part. He agreed to the change. If the setcc part is also looks fine I will post patches to fix this pattern.

llvmbot commented 8 years ago

For the problem in comment 3, I am thinking of the following solution. I post it here to get some feedback

The IR currently generated for the testcase is the following:

define zeroext i1 @​Z3fooRK3PB2S1(%class.PB2 nocapture readonly dereferenceable(16) %s_a, %class.PB2 nocapture readonly dereferenceable(16) %s_b) local_unnamed_addr #​0 { entry: %arrayidx.i6 = bitcast %class.PB2 %s_a to i32 %0 = load i32, i32 %arrayidx.i6, align 8, !tbaa !​1 %and.i = and i32 %0, 8 %arrayidx.i37 = bitcast %class.PB2 %s_b to i32 %1 = load i32, i32 %arrayidx.i37, align 8, !tbaa !​1 %and.i4 = and i32 %1, 8 %cmp.i5 = icmp ne i32 %and.i4, 0 %cmptmp = icmp eq i32 %and.i, 0 %cmp = and i1 %cmptmp, %cmp.i5 ret i1 %cmp }

I think it is reasonable to expect InstCombine to recognize that it can remove all icmp instruction and replace them with the following:

%cmp = icmp ult i32 %and.i, %and.i4

This is not enough for us though. During lowering for this code we generate a setcc for which we generate an isel. The next step is to look at the following during lowering:

1) The user of setcc is not a branch. This value is returned to the caller. So we need the result of comparison in GPR. 2) This is a setcc with a ult condition code, applied to two i32 values. So we can instead subtract the values, and look at the sign bit (so subtract and shift). This is what GCC does when they encounter a < operation during translation of gimple to RTL. (I talked to the gcc developer who implemented this).

Does this make sense?

llvmbot commented 8 years ago

Here the problem is not related to cr instruction. Instead we are generating the following two instructions:

c: fe e8 63 54 rlwinm r3,r3,29,3,31 10: e0 07 63 78 clrldi r3,r3,63

While gcc generates only one

c: e2 ef 63 78 rldicl r3,r3,61,63

Thanks you, Ehsan, I also noticed this pattern in another case, but I failed to simplify it.

You're welcome. This is a really good code testcase as it reveals multiple problems.

weiguozhi commented 8 years ago

Here the problem is not related to cr instruction. Instead we are generating the following two instructions:

c: fe e8 63 54 rlwinm r3,r3,29,3,31 10: e0 07 63 78 clrldi r3,r3,63

While gcc generates only one

c: e2 ef 63 78 rldicl r3,r3,61,63

Thanks you, Ehsan, I also noticed this pattern in another case, but I failed to simplify it.

llvmbot commented 8 years ago

This sequence has a more serious problem:

bool foo(const PB2& s_a, const PB2& s_b) { return s_a.has_pb1() < s_b.has_pb1(); }

LLVM code:

0: 00 00 84 88 lbz r4,0(r4) 4: 00 00 63 88 lbz r3,0(r3) 8: fe e8 84 54 rlwinm r4,r4,29,3,31 c: 38 07 63 54 rlwinm r3,r3,0,28,28 10: 00 00 83 2c cmpwi cr1,r3,0 14: 01 00 84 70 andi. r4,r4,1 18: 01 00 60 38 li r3,1 1c: c2 09 86 4e crnand 4cr5+lt,4cr1+eq,gt 20: 1e 1d 60 7c isel r3,0,r3,20 24: 20 00 80 4e blr

GCC code:

0: 00 00 23 81 lwz r9,0(r3) 4: 00 00 64 80 lwz r3,0(r4) 8: e2 ef 29 79 rldicl r9,r9,61,63 c: e2 ef 63 78 rldicl r3,r3,61,63 10: 50 48 63 7c subf r3,r3,r9 14: e0 0f 63 78 rldicl r3,r3,1,63 18: 20 00 80 4e blr

llvmbot commented 8 years ago

https://reviews.llvm.org/D24924

for issue in comment 1

llvmbot commented 8 years ago

I have reduced the function foo even further, to the shortest code, in which I can find inefficiency. Once I fix the first issue, I will add more code to it, until I find the next issue to fix.

The smallest problematic code is this:

bool foo(const PB2& s_a, const PB2& s_b) {

return s_a.has_pb1() != s_b.has_pb1();

}

Here the problem is not related to cr instruction. Instead we are generating the following two instructions:

c: fe e8 63 54 rlwinm r3,r3,29,3,31 10: e0 07 63 78 clrldi r3,r3,63

While gcc generates only one

c: e2 ef 63 78 rldicl r3,r3,61,63

weiguozhi commented 8 years ago

assigned to @nemanjai