Closed lattner closed 14 years ago
Evan fixed this a long time ago, I now get this inner loop:
.align 4,0x90
LBB1_1: ## bb cvtsi2ss (%esi,%edx,4), %xmm1 incl %edx cmpl %ecx, %edx addss %xmm1, %xmm0 jne LBB1_1 ## bb
Is this done?
Looking at this now.
After some coalescing, we come to this:
bb:
28 %reg1026
Now we try to coalesce away the copy at 64 except reg1028 and reg1036 live intervals conflict. If ADDSSrr at 56 is commuted and uses are updated:
56 %reg1036
Then the copy is already coalesced away.
woot, thanks evan.
I see this type of codegen deficiency all the time. This is worth further investigation.
I enabled this by default: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20071224/056596.html
It is either a wash or a pretty significant win on programs it helps with. The only case it slows down is shootout/fib2, which appears to be alignment or something. That said, fib2 suffers from the same "add not commuted" coalescing issues as the other programs. It's inner loop is now:
LBB1_2: # cond_false leal -2(%esi), %eax movl %eax, (%esp) call _fib addl %edi, %eax decl %esi cmpl $1, %esi movl %eax, %edi ja LBB1_2 # cond_false
The copy at the end of the loop would be gone if we had "addl %eax, %edi" in the body of the loop.
-Chris
Here's a further-reduced testcase:
unsigned NNTOT;
volatile float G;
void runcont (int *source) {
int row = 0, neuron = 0;
float thesum=0.0;
do {
thesum+=source[neuron];
} while (++neuron<NNTOT);
G=thesum;
}
it compiles to:
LBB1_1: # bb cvtsi2ss (%ecx,%edx,4), %xmm1 incl %edx cmpl %eax, %edx addss %xmm0, %xmm1 jae LBB1_3 # bb13 LBB1_2: # bb.bb_crit_edge movaps %xmm1, %xmm0 jmp LBB1_1 # bb
This definitely requires commuting the addss to coallesce.
This patch: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20071224/056589.html
Is a hack that avoids the split edge on single-bb loops. However, the copy coalescing problem remains. With -backedge-hack, we now get:
LBB1_2: # bb17 cvtsi2ss (%ecx,%esi,4), %xmm0 mulss (%edx,%esi,4), %xmm0 incl %esi cmpl %eax, %esi addss %xmm1, %xmm0 movaps %xmm0, %xmm1 jb LBB1_2 # bb17
instead of:
LBB1_2: # bb17 cvtsi2ss (%ecx,%esi,4), %xmm0 mulss (%edx,%esi,4), %xmm0 incl %esi cmpl %eax, %esi addss %xmm1, %xmm0 jae LBB1_5 # bb22 LBB1_3: # bb17.bb17_crit_edge movaps %xmm0, %xmm1 jmp LBB1_2 # bb17
I think the coalescer could eliminate the copy if it commuted addss.
FWIW, GCC compiles the inner loop into:
L5: cvtsi2ss (%esi,%edx,4), %xmm0 incl %ecx cmpl %ebx, %ecx mulss (%eax,%edx,4), %xmm0 movl %ecx, %edx addss %xmm0, %xmm1 jne L5
Extended Description
The significant (negative) performance delta on Freebench/neural on x86 is due to a backedge copy not getting coalesced. The backedge critical edge is then split to hold the copy, and the critical edge block is put into a very bad place. This makes the code run significantly slower than the GCC code, whcih doesn't make this mistake. Here's a reduced testcase:
unsigned NNTOT; float *Tmatrix; volatile float G; int runcont (signed int source[], signed int dest[]) { int row = 0, neuron; // for(row=0; row<NNTOT; row++) { float thesum=0.0; for(neuron=0; neuron<NNTOT; neuron++) thesum+=Tmatrix[row][neuron]source[neuron]; G=thesum;
//} }
If you enable the two commented out lines, you'll get the following really bad code:
LBB1_4: # bb1 cvtsi2ss (%ecx,%edi,4), %xmm0 mulss (%esi,%edi,4), %xmm0 incl %edi cmpl %eax, %edi addss %xmm1, %xmm0 jb LBB1_8 # bb1.bb1_crit_edge LBB1_5: # bb23 ...
LBB1_8: # bb1.bb1_crit_edge movaps %xmm0, %xmm1 jmp LBB1_4 # bb1
If the lines are commented out, you get the same bad code, but the edge happens to be put into a better place by luck.
-Chris