Not good asm for sparse matrix multiplication kernel

llvmbot commented 14 years ago


Bugzilla Link	6993
Version	2.7
OS	Windows XP
Attachments	Just the sparse matmult part of the SciMark2 benchmark, in C
Reporter	LLVM Bugzilla Contributor
CC	@asl,@lattner,@sunfishcode,@stoklund

Extended Description

The SciMark2 benchmark shows two points where llvm-gcc 2.7 generates suboptimal asm code (they can be found by visual inspection of the asm and comparing the timings with gcc 4.5).

The less significant of those two points is in the "Sparse matmult" part of SciMark2 (the other more important problem is left to another bug report). A good thing of this point is that it's easy to spot and it's compact.

In attach there is a reduced version of SciMark2 in C language that performs only the Sparse matmult.

Timings, NLOOPS=400_000, seconds: gcc 4.5: 5.49 llvm-gcc 2.7: 5.69

CPU Celeron 2.13 GHz, Windows Vista 32 bit.

In both cases code compiled with: -O3 -s -fomit-frame-pointer -msse3 -march=core2

The situation can also be seen inspecting the asm generated by the two compilers:

gcc, inner loop of SparseCompRow_matmult(): L45: movl (%esi,%eax,4), %edx fldl (%edi,%edx,8) fmull (%ebx,%eax,8) incl %eax faddp %st, %st(1) cmpl %eax, %ecx jg L45

llvm-gcc, inner loop of SparseCompRow_matmult(): LBB3_6: movl (%edi), %edx movl 52(%esp), %ebp addl $4, %edi movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addl $8, %ebx decl %esi addsd %xmm1, %xmm0 jne LBB3_6

llvm-gcc here has four loads from memory instead of the necessary three. Normally one extra load is not an important difference, but inside the inner loops of numerical FP kernels even few extra asm instructions can be significant.

llvmbot commented 11 years ago

Renamed issue title because now the original main issue seems solved.

llvmbot commented 11 years ago

I have repeated this test again with llvm3.3-almost-svn. I have not taken timings again, so now I just show the asm produced. This is translated to D language, with just the interesting routine of sparse matrix multiplication:

void SparseCompRow_matmult(in size_t M, double y, in double val, in size_t row, in size_t col, in double x, in size_t NUM_ITERATIONS) pure nothrow { foreach (immutable size_t reps; 0 .. NUM_ITERATIONS) { foreach (immutable size_t r; 0 .. M) { double sum = 0.0; foreach (immutable size_t i; row[r] .. row[r + 1]) sum += x[col[i]] val[i]; y[r] = sum; } } } void main() {}

In past LLVM used to produce this from the C code, for the inner loop: LBB3_6: movl (%edi), %edx movl 52(%esp), %ebp addl $4, %edi movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addl $8, %ebx decl %esi addsd %xmm1, %xmm0 jne LBB3_6

Now ldc2 produces this code, accessing memory only 3 times, so the main problem of this bug report seems solved: LBB0_5: movl (%edi), %edx movsd (%ecx,%edx,8), %xmm0 mulsd (%ebp), %xmm0 addsd %xmm1, %xmm0 addl $4, %edi addl $8, %ebp decl %esi movaps %xmm0, %xmm1 jne LBB0_5

Today GCC 4.8.0 (gcc -std=c99 -Ofast -msse3 -mfpmath=sse -march=native -S) produces this code for the inner loop, better than the code produced by ldc2: L41: movl (%esi,%eax,4), %edx movsd (%edi,%edx,8), %xmm0 mulsd (%ebx,%eax,8), %xmm0 addl $1, %eax cmpl %ecx, %eax addsd %xmm0, %xmm1 jne L41

(In such scientific code despite the whole compiled program can be many thousands of asm instructions long, often the only part that matters is the inner loop of one or few computational kernels, here just 7 asm instructions long with gcc. Even a single bad instruction there can slow down the whole program a little or a lot. The rest of the program can be compiled with -Os or -Oz with little change of the overall run-time.)

1ba3d143-a64b-4671-82b2-0b31cfb91709 commented 13 years ago

The greedy register allocator is unable to find a good split for the argument load, so it rematerializes it inside the loop as well.

sunfishcode commented 14 years ago

It's a remat'ed stack argument load. Before register allocation, the load is outside the loop.

lattner commented 14 years ago

Random comment: please turn on -asm-verbose in the llvm .s dumps. That will tell you if it's a spill or not (for example).

sunfishcode commented 14 years ago

LLVM IR testcase Attached is an LLVM IR testcase. The loop is in %for.body22.

This looks like a live-range-splitting and/or spill weight heuristic case, but could also be handled by post-RA LICM. %ebp is needlessly reloaded inside the loop:

LBB0_7: movl (%edi), %edx movl 52(%esp), %ebp movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addsd %xmm1, %xmm0 addl $4, %edi addl $8, %ebx decl %esi jne LBB0_7

llvmbot commented 14 years ago

Asm of the whole SparseCompRow_matmult function, as generated by gcc and llvm-gcc

llvmbot commented 14 years ago

After a suggestion by edwin on IRC I have changed the compilation arguments a little, to use the SSE registers with gcc 4.5 too:

gcc -mfpmath=sse -O3 -s -fomit-frame-pointer -msse3 -march=core2

Timings, NLOOPS=400_000, seconds: gcc 4.5: 5.55 llvm-gcc 2.7: 5.70

llvm-gcc 2.7: LBB3_6: movl (%edi), %edx movl 52(%esp), %ebp addl $4, %edi movsd (%ebp,%edx,8), %xmm1 mulsd (%ebx), %xmm1 addl $8, %ebx decl %esi addsd %xmm1, %xmm0 jne LBB3_6

gcc 4.5 L44: movl (%esi,%eax,4), %edx movsd (%edi,%edx,8), %xmm0 mulsd (%ebx,%eax,8), %xmm0 incl %eax addsd %xmm0, %xmm1 cmpl %eax, %ecx jg L44

It shows that this is not a problem caused by x87 Vs SSE registers.

llvm / llvm-project

Not good asm for sparse matrix multiplication kernel #7365

Extended Description