Open llvmbot opened 14 years ago
Renamed issue title because now the original main issue seems solved.
I have repeated this test again with llvm3.3-almost-svn. I have not taken timings again, so now I just show the asm produced. This is translated to D language, with just the interesting routine of sparse matrix multiplication:
void SparseCompRow_matmult(in size_t M, double y, in double val, in size_t row, in size_t col, in double x, in size_t NUM_ITERATIONS) pure nothrow { foreach (immutable size_t reps; 0 .. NUM_ITERATIONS) { foreach (immutable size_t r; 0 .. M) { double sum = 0.0; foreach (immutable size_t i; row[r] .. row[r + 1]) sum += x[col[i]] val[i]; y[r] = sum; } } } void main() {}
In past LLVM used to produce this from the C code, for the inner loop: LBB3_6: movl (%edi), %edx movl 52(%esp), %ebp addl $4, %edi movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addl $8, %ebx decl %esi addsd %xmm1, %xmm0 jne LBB3_6
Now ldc2 produces this code, accessing memory only 3 times, so the main problem of this bug report seems solved: LBB0_5: movl (%edi), %edx movsd (%ecx,%edx,8), %xmm0 mulsd (%ebp), %xmm0 addsd %xmm1, %xmm0 addl $4, %edi addl $8, %ebp decl %esi movaps %xmm0, %xmm1 jne LBB0_5
Today GCC 4.8.0 (gcc -std=c99 -Ofast -msse3 -mfpmath=sse -march=native -S) produces this code for the inner loop, better than the code produced by ldc2: L41: movl (%esi,%eax,4), %edx movsd (%edi,%edx,8), %xmm0 mulsd (%ebx,%eax,8), %xmm0 addl $1, %eax cmpl %ecx, %eax addsd %xmm0, %xmm1 jne L41
(In such scientific code despite the whole compiled program can be many thousands of asm instructions long, often the only part that matters is the inner loop of one or few computational kernels, here just 7 asm instructions long with gcc. Even a single bad instruction there can slow down the whole program a little or a lot. The rest of the program can be compiled with -Os or -Oz with little change of the overall run-time.)
The greedy register allocator is unable to find a good split for the argument load, so it rematerializes it inside the loop as well.
It's a remat'ed stack argument load. Before register allocation, the load is outside the loop.
Random comment: please turn on -asm-verbose in the llvm .s dumps. That will tell you if it's a spill or not (for example).
LLVM IR testcase Attached is an LLVM IR testcase. The loop is in %for.body22.
This looks like a live-range-splitting and/or spill weight heuristic case, but could also be handled by post-RA LICM. %ebp is needlessly reloaded inside the loop:
LBB0_7: movl (%edi), %edx movl 52(%esp), %ebp movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addsd %xmm1, %xmm0 addl $4, %edi addl $8, %ebx decl %esi jne LBB0_7
After a suggestion by edwin on IRC I have changed the compilation arguments a little, to use the SSE registers with gcc 4.5 too:
gcc -mfpmath=sse -O3 -s -fomit-frame-pointer -msse3 -march=core2
Timings, NLOOPS=400_000, seconds: gcc 4.5: 5.55 llvm-gcc 2.7: 5.70
llvm-gcc 2.7: LBB3_6: movl (%edi), %edx movl 52(%esp), %ebp addl $4, %edi movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addl $8, %ebx decl %esi addsd %xmm1, %xmm0 jne LBB3_6
gcc 4.5 L44: movl (%esi,%eax,4), %edx movsd (%edi,%edx,8), %xmm0 mulsd (%ebx,%eax,8), %xmm0 incl %eax addsd %xmm0, %xmm1 cmpl %eax, %ecx jg L44
It shows that this is not a problem caused by x87 Vs SSE registers.
Extended Description
The SciMark2 benchmark shows two points where llvm-gcc 2.7 generates suboptimal asm code (they can be found by visual inspection of the asm and comparing the timings with gcc 4.5).
The less significant of those two points is in the "Sparse matmult" part of SciMark2 (the other more important problem is left to another bug report). A good thing of this point is that it's easy to spot and it's compact.
In attach there is a reduced version of SciMark2 in C language that performs only the Sparse matmult.
Timings, NLOOPS=400_000, seconds: gcc 4.5: 5.49 llvm-gcc 2.7: 5.69
CPU Celeron 2.13 GHz, Windows Vista 32 bit.
In both cases code compiled with: -O3 -s -fomit-frame-pointer -msse3 -march=core2
The situation can also be seen inspecting the asm generated by the two compilers:
gcc, inner loop of SparseCompRow_matmult(): L45: movl (%esi,%eax,4), %edx fldl (%edi,%edx,8) fmull (%ebx,%eax,8) incl %eax faddp %st, %st(1) cmpl %eax, %ecx jg L45
llvm-gcc, inner loop of SparseCompRow_matmult(): LBB3_6: movl (%edi), %edx movl 52(%esp), %ebp addl $4, %edi movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addl $8, %ebx decl %esi addsd %xmm1, %xmm0 jne LBB3_6
llvm-gcc here has four loads from memory instead of the necessary three. Normally one extra load is not an important difference, but inside the inner loops of numerical FP kernels even few extra asm instructions can be significant.