[ppc] redundant extsw and sldi instructions

weiguozhi commented 9 years ago


Bugzilla Link	25581
Version	trunk
OS	Linux
CC	@echristo,@hfinkel

Extended Description

The following is a hot function from 471.omnetpp

void cMessageHeap::shiftup(int from)                                                                             
{
    // restores heap structure (in a sub-heap)                                                                   
    int i,j;
    cMessage *temp;                                                                                              

    i=from;
    while ((j=2*i) <= n)                                                                                         
    {                                                                                                            
        if (j<n && (*h[j] > *h[j+1]))   //direction                                                              
            j++;
        if (*h[i] > *h[j])  //is change necessary?
        {
            temp=h[j];                                                                                           
            (h[j]=h[i])->heapindex=j;
            (h[i]=temp)->heapindex=i;                                                                            
            i=j;
        }
        else
            break;
    }
}

When compiled with options -fno-strict-aliasing -O2 -m64 -mvsx -mcpu=power8 I got:

_ZN12cMessageHeap7shiftupEi:            # @&#8203;_ZN12cMessageHeap7shiftupEi
.Lfunc_begin14:
# BB#0:                                 # %entry
        lwz 7, 64(3)
        slwi 8, 4, 1 
        cmpw     8, 7 
        bgtlr 0
# BB#1:                                 # %while.body.lr.ph
        crxor 20, 20, 20
        li 5, 144
        creqv 21, 21, 21
        .align  4
.LBB14_2:                               # %while.body
                                        # =>This Inner Loop Header: Depth=1
        mr 6, 4 
        cmpw     8, 7 
        bge 0, .LBB14_9
# BB#3:                                 # %land.lhs.true
                                        #   in Loop: Header=BB14_2 Depth=1
        ori 4, 8, 1                  // A
        extsw 9, 8                   // B
        ld 7, 56(3)
        cror 1, 20, 20
        extsw 10, 4                  // C
        sldi 9, 9, 3                 // D
        sldi 11, 10, 3               // E
        ldx 10, 7, 9 
        ldx 9, 7, 11
        lxsdx 0, 10, 5
        lxsdx 1, 9, 5 
        xscmpudp 1, 0, 1 
        blt      1, .LBB14_8
        ...

In the source code note that j comes from (i*2), so the lowest bit is 0, and then we have following conclusion:

1) j+1 equals to j|1, so instruction A actually computes j+1

2) extsw(j+1) equals to extsw(j) + 1

3) (j+1) << 3 equals to j<<3 + 8

So instructions ABCDE can be replaced with shorter code sequence

    extsw 9, 8
    sldi 9, 9, 3
    addi 11, 9, 8

llvmbot commented 7 years ago

To be clear this is mostly that the code isn't great since it still has the two extsw/sldi with the addition now of some isel, rather than the GEP offset pass not working.

If you read bug description and comment 3, GEP offset removes other pairs of extsw/sldi. Bug description is about a very specific problem and did not extend to any potential improvement in the code.

echristo commented 7 years ago

To be clear this is mostly that the code isn't great since it still has the two extsw/sldi with the addition now of some isel, rather than the GEP offset pass not working.

echristo commented 7 years ago

Definitely doesn't look fixed:

10072360: a0 00 6c e9 ld 11, 160(12) 10072364: a0 00 be eb ld 29, 160(30) 10072368: 40 e8 2b 7c cmpld 11, 29 1007236c: 5e 30 67 7d isel 11, 7, 6, 1 10072370: 00 00 0b 28 cmplwi 11, 0 10072374: 9e 20 8a 7c isel 4, 10, 4, 2 10072378: b4 07 0a 7d extsw 10, 8 1007237c: b4 07 8b 7c extsw 11, 4 10072380: 24 1f 4a 79 sldi 10, 10, 3 10072384: 24 1f 6b 79 sldi 11, 11, 3 10072388: 2a 50 89 7d ldx 12, 9, 10 1007238c: 6a 58 69 7d ldux 11, 9, 11 10072390: 98 2c 0c 7c lxsdx 0, 12, 5 10072394: 98 2c 2b 7c lxsdx 1, 11, 5 10072398: 18 09 00 f0 xscmpudp 0, 0, 1 1007239c: 58 00 80 41 bt 0, .+88

just different. Investigating.

llvmbot commented 8 years ago

patch http://reviews.llvm.org/D18562 committed

llvmbot commented 8 years ago

I ran the test suite with the three transformations above enable at O3 vs. O2. The difference in compile time is negligible. Will submit the patch to make the change.

hfinkel commented 8 years ago

SeparateConstOffsetFromGEP is controlled in PPCPassConfig::addIRPasses(). Currently it requires O3. We can turn it on, with no impact on other targets. We need to discuss about all three passes in PPCPassConfig::addIRPasses() whether we want them to be turned on at O2 or not. (I think LICM should be safe to move to O2, I have more doubts about other two).

We have SeparateConstOffsetFromGEP, EarlyCSE (which is designed to be cheap), and LICM. All of these can be used at -O2. I have no problem with that. I put them at -O3 only because I was being conservative with SeparateConstOffsetFromGEP (which was a new optimization at the time). Now that we have a use case for them to be at -O2, please go ahead with enabling them at -O2.

Regarding developing a new widening transformations for induction variables in non-canonical loops, that certainly could be useful.

llvmbot commented 8 years ago

SeparateConstOffsetFromGEP is controlled in PPCPassConfig::addIRPasses(). Currently it requires O3. We can turn it on, with no impact on other targets. We need to discuss about all three passes in PPCPassConfig::addIRPasses() whether we want them to be turned on at O2 or not. (I think LICM should be safe to move to O2, I have more doubts about other two).

llvmbot commented 8 years ago

What happens at O3 is that a transformation called SeparateConstOffsetFromGEP in llc, commons some part of address for h[j] and h[j+1]. The optimization will work when we have GEPs with constant distance from each other.

This is some sort of commoning. Can't we turn that on at O2?

We can look at this from another angle as well. The code that we are currently discussing is corresponding to accesses to h[j] and h[j+1] in the first if statement.

There is a subsequent if statement that checks h[i] and h[j]. The optimization above does not work here because i and j do not have a constant distance from each other. So even in O3 code we have two extsw and two sldi for the second if statement.

We can think of an algorithm that can understand both i and j are indices to an array and promote them to 64 bit before the loop. This will remove all extsw instns inside the loop. I suspect that we already do this for canonicalized loops, so this will be useful only for loops that cannot be canonicalized or code that is not inside a loop.

There might be benefit in following both paths.

hfinkel commented 8 years ago

Is this a backend or a mid-end issue? Can you please provide some corresponding IR?

FYI: I can reproduce, but I'll also note that the problem does not appear when compiling with -O3.

hfinkel commented 8 years ago

Is this a backend or a mid-end issue? Can you please provide some corresponding IR?

llvm / llvm-project

[ppc] redundant extsw and sldi instructions #25955

Extended Description