llvmbot commented 14 years ago


Bugzilla Link	5540
Version	2.6
OS	Windows NT
Reporter	LLVM Bugzilla Contributor
CC	@asl,@lattner,@sunfishcode

Extended Description

See bug 5501 to understand where this comes from: http://llvm.org/bugs/show_bug.cgi?id=5501

After a discussion on IRC it seems LLVM is not able to perform a partial loop unwinding when the iteration count is a run-time variable, a simple example of C code:

include

attribute((noinline)) int foo(int x) { return x * x; } int main() { int n = atoi("100"); // a run-time value int i, tot = 0; for (i = 0; i < n; i++) tot += foo(i); printf("%d\n", tot); return 0; }

But the JavaVM HotSpot shows that optimization can be useful. Generally an unrolling of 2 or 4 or 8 is enough (4 is typical. A much larger value is useful only in special situations). Such number can be chosen heuristically according to the amount of asm instructions inside the loop.

A little little example in C that shows how to perform first the modulus of the loop unrolling, followed by 4X unroll (as done by HotSpot):

include

int main() { int n = atoi("20"); // a run-time value int unroll = 4; // usually 2, 4 or 8. Normally 4

int rest = n % unroll;

int i;
for (i = 0; i < rest; i++)
    printf("%d\n", i);

for (i = rest; i < n; i += 4) {
    printf("%d\n", i + 0);
    printf("%d\n", i + 1);
    printf("%d\n", i + 2);
    printf("%d\n", i + 3);
}

return 0;

}

A run-time profiling too can help locate what loops may enjoy such dynamic unrolling. In the meantime annotations and compilation flags are blunt tools that are enough to ask the backend for this optimization.

llvmbot commented 12 years ago

I have not yet tested it (because of a bug in Clang I have to report still), but is -unroll-runtime solving this problem? How is the LU benchmark going using -unroll-runtime?

https://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp?revision=146245&view=markup&sortby=date

llvmbot commented 14 years ago

I have found a situation where LDC doesn't perform partial loop unwinding even when the number of loop cycles is known at compile-time. This is a small example benchmark:

include "stdio.h"

include "stdlib.h"

// L is a prime number

define L 53

double M1[L L]; double M2[L L]; double M3[L * L];

void init_matrices() { int j; for (j = 0; j < L * L; j++) { // loop A ***** M1[j] = j + 1; M2[j] = 1.0 / (j + 1); } }

void M1_mul_M2() { int i, j, k; for (i = 0; i < L; i++) for (j = 0; j < L; j++) { double sum = 0; for (k = 0; k < L; k++) // loop B **** sum += M1[L i + k] M2[L k + j]; M3[L * i + j] = sum; } }

int main(int argc, char **argv) { int n = argc == 2 ? atoi(argv[1]) : 20000; init_matrices();

double tot = 0.0;

int i;
for (i = 0; i < n; i++) {
    M1_mul_M2();
    tot += M3[i % (L * L)];
}

printf("tot:%f\n", tot);
return 0;

}

GCC is able to perform vectorization (in loop B it performs two muls in parallel), that LLVM isn't able yet do to.

But the loop B can also be partially unrolled. The compiler can unroll the loop B 10 times, and performs the left (three) iterations after the loop. The situation is similar to the one I've shown before, but here it's simpler for the compiler because all is known at compile time.

llvmbot commented 14 years ago

Thank you for your explanation, that shows me how to act :-)

This optimization is present and often done by the Java HotSpot, so Sun engineers think it's not a waste of their time. And my experiments show they are right.

In the last C attach I've shown a better example, it's a stripped down version of the SciMark2 benchmark in C language that performs the LU benchmark only, it prints the MFlops. It contains a part that performs conditional compilation, if you define DO_UNROLL it performs the optimization quite similar to the one done by HotSpot, otherwise uses the original SciMark2 code, so you can compare the performance improvement. Timings are more info are at the top of the code.

Note that in this code I assume the loop count is NOT known at compile-time.

If you want I can show you another code example where this optimization is useful.

Note that generally the more code there's in the loop, the less number of times it's useful to perform such partial unwinding. Here HotSpot unwinds 8 times (but I've seen that about 10 is optimal) because in the inner loop there's just one line: Aii[jj] -= AiiJ * Aj[jj]; But in another example, where there's more stuff inside the loop, HotSpot unwinds 4 times only, because unwinding more probably puts the CPU code cache under too much pressure, reducing performance. Life is made of compromises.

I think it's not too much hard to implement this feature, and I think it can be useful if applied wisely. But there's a problem: I think a static compiler generally doesn't know what loops to partially unroll. HotSpot knows it because the Java code is usually under profiling, while C/C++/D code compiled by LLVM is not (unless LLVM profile-guided optimization is used). So this may require user annotations, or profile-guided optimization, or smart compiler heuristics to understand where and what partially unroll (how much unroll is probably not hard to determine, just counting how many instructions are present inside the loop). Both the annotation route and the profile-guided optimization route seem doable and not too much hard. The heuristics route looks harder to me.

lattner commented 14 years ago

That's nice, but why do you think this is beneficial?

llvmbot commented 14 years ago

Reduced SciMark2 that performs the LU benchmark only

lattner commented 14 years ago

bearophile, I saw your comment in irc. The reason this is closed is that it is not a generally profitable optimization, at least not with the example you give here.

lattner commented 14 years ago

This bug has been marked as a duplicate of bug llvm/llvm-project#5873

llvm / llvm-project

Partial loop unwinding with variable iteration count #5912

Extended Description

include

include

include

include

include "stdio.h"

include "stdlib.h"

define L 53