add needs to be commuted to eliminate copy in critical loops

lattner commented 16 years ago


Bugzilla Link	1877
Resolution	FIXED
Resolved on	Mar 06, 2010 14:00
Version	1.0
OS	All
CC	@asl,@isanbard

Extended Description

The significant (negative) performance delta on Freebench/neural on x86 is due to a backedge copy not getting coalesced. The backedge critical edge is then split to hold the copy, and the critical edge block is put into a very bad place. This makes the code run significantly slower than the GCC code, whcih doesn't make this mistake. Here's a reduced testcase:

unsigned NNTOT; float *Tmatrix; volatile float G; int runcont (signed int source[], signed int dest[]) { int row = 0, neuron; // for(row=0; row<NNTOT; row++) { float thesum=0.0; for(neuron=0; neuron<NNTOT; neuron++) thesum+=Tmatrix[row][neuron]source[neuron]; G=thesum;
//} }

If you enable the two commented out lines, you'll get the following really bad code:

LBB1_4: # bb1 cvtsi2ss (%ecx,%edi,4), %xmm0 mulss (%esi,%edi,4), %xmm0 incl %edi cmpl %eax, %edi addss %xmm1, %xmm0 jb LBB1_8 # bb1.bb1_crit_edge LBB1_5: # bb23 ...

LBB1_8: # bb1.bb1_crit_edge movaps %xmm0, %xmm1 jmp LBB1_4 # bb1

If the lines are commented out, you get the same bad code, but the edge happens to be put into a better place by luck.

-Chris

lattner commented 16 years ago

Evan fixed this a long time ago, I now get this inner loop:

.align  4,0x90

LBB1_1: ## bb cvtsi2ss (%esi,%edx,4), %xmm1 incl %edx cmpl %ecx, %edx addss %xmm1, %xmm0 jne LBB1_1 ## bb

lattner commented 16 years ago

Is this done?

llvmbot commented 16 years ago

Looking at this now.

After some coalescing, we come to this: bb: 28 %reg1026 = MOV32rr %reg1035 32 %reg1036 = FsMOVAPSrr %reg1036 36 %reg1033 = CVTSI2SSrm %reg1024, 4, %reg1026, 0, Mem:LD(4) [tmp45 + 0] 40 %reg1029 = MOV32rr %reg1026 44 %reg1029 = INC32r %reg1029, %EFLAGS<imp-def,dead> 48 CMP32rr %reg1029, %reg1025, %EFLAGS 52 %reg1028 = FsMOVAPSrr %reg1033 56 %reg1028 = ADDSSrr %reg1028, %reg1036 60 %reg1035 = MOV32rr %reg1029 64 %reg1036 = FsMOVAPSrr %reg1028 68 JNE mbb<bb,0x1204520>, %EFLAGS<imp-use,kill> bb13: 72 %reg1034 = MOV32rm %reg0, 1, %reg0, , Mem:LD(4) [GOT + 0] 76 MOVSSmr %reg1034, 1, %reg0, 0, %reg1028, Mem:Volatile ST(4) [G + 0] 80 RET

Now we try to coalesce away the copy at 64 except reg1028 and reg1036 live intervals conflict. If ADDSSrr at 56 is commuted and uses are updated: 56 %reg1036 = ADDSSrr %reg1036, %reg1028 64 %reg1036 = FsMOVAPSrr %reg1036 76 MOVSSmr %reg1034, 1, %reg0, 0, %reg1036, Mem:Volatile ST(4) [G + 0]

Then the copy is already coalesced away.

lattner commented 16 years ago

woot, thanks evan.

llvmbot commented 16 years ago

I see this type of codegen deficiency all the time. This is worth further investigation.

lattner commented 16 years ago

I enabled this by default: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20071224/056596.html

It is either a wash or a pretty significant win on programs it helps with. The only case it slows down is shootout/fib2, which appears to be alignment or something. That said, fib2 suffers from the same "add not commuted" coalescing issues as the other programs. It's inner loop is now:

LBB1_2: # cond_false leal -2(%esi), %eax movl %eax, (%esp) call _fib addl %edi, %eax decl %esi cmpl $1, %esi movl %eax, %edi ja LBB1_2 # cond_false

The copy at the end of the loop would be gone if we had "addl %eax, %edi" in the body of the loop.

-Chris

lattner commented 16 years ago

Here's a further-reduced testcase:

unsigned NNTOT; volatile float G; void runcont (int *source) { int row = 0, neuron = 0; float thesum=0.0; do { thesum+=source[neuron]; } while (++neuron<NNTOT); G=thesum;
}

it compiles to:

LBB1_1: # bb cvtsi2ss (%ecx,%edx,4), %xmm1 incl %edx cmpl %eax, %edx addss %xmm0, %xmm1 jae LBB1_3 # bb13 LBB1_2: # bb.bb_crit_edge movaps %xmm1, %xmm0 jmp LBB1_1 # bb

This definitely requires commuting the addss to coallesce.

lattner commented 16 years ago

This patch: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20071224/056589.html

Is a hack that avoids the split edge on single-bb loops. However, the copy coalescing problem remains. With -backedge-hack, we now get:

LBB1_2: # bb17 cvtsi2ss (%ecx,%esi,4), %xmm0 mulss (%edx,%esi,4), %xmm0 incl %esi cmpl %eax, %esi addss %xmm1, %xmm0 movaps %xmm0, %xmm1 jb LBB1_2 # bb17

instead of:

LBB1_2: # bb17 cvtsi2ss (%ecx,%esi,4), %xmm0 mulss (%edx,%esi,4), %xmm0 incl %esi cmpl %eax, %esi addss %xmm1, %xmm0 jae LBB1_5 # bb22 LBB1_3: # bb17.bb17_crit_edge movaps %xmm0, %xmm1 jmp LBB1_2 # bb17

I think the coalescer could eliminate the copy if it commuted addss.

lattner commented 16 years ago

FWIW, GCC compiles the inner loop into:

L5: cvtsi2ss (%esi,%edx,4), %xmm0 incl %ecx cmpl %ebx, %ecx mulss (%eax,%edx,4), %xmm0 movl %ecx, %edx addss %xmm0, %xmm1 jne L5

llvm / llvm-project

add needs to be commuted to eliminate copy in critical loops #2249

Extended Description