Open llvmbot opened 12 years ago
I haven't look at it that closely. It's a generally known problem that codegen isn't using TBAA very aggressively though.
Hi Chris, thank you for having a look. Sorry if I am off target here, but is it not the case that tbaa info would tell us if two pointers can/cannot alias each other? I don't see how this helps. If I were to change the p from an int to a double (in which case presumably tbaa would not be able to help me), I still expect to get the optimal code. The point is that the whole sum() function call happens before the p[i] assignment.
Moreover changing the assignment from p[i] = 5; to *p = 5; does produce the optimal version of the code. So it has something to do with the loop variable.
The big failure here seems to be that we're not folding the loads (e.g. 56(%edx,%ebx,8)) into the addsd's. I think that this is because regalloc and isle are not able to prove that the integer store to p[i] can't alias the fp loads. Jakob, do you concur? Is there a straight-forward way to propagate TBAA info late enough to catch this?
Extended Description
I have run into the following strange llvm behavior. For the C program below, function sum() gets inlined in foo() but the code generated looks very suboptimal (the code is an extract from a larger program).
Below I show the 32-bit x86 assembly as produced by the demo page on the llvm home page ("Output A"). As you can see from the assembly, after sum() is inlined and the loop unrolled, the generated code loads all values of array v (aka &x[i]) into registers before adding any numbers up -- in the process it runs out of registers and starts spilling (in essense copying the doubles from one area of memory to another). After that, it proceeds to add the numbers up.
But why not add the numbers into 1 register directly? Clearly this is what the C code is doing -- nothing could have been more explicit. The really strange thing, is that in the assingment to p[i] is removed (line marked with "xxx..."), then the code produced is optimal and exactly what one expects. I show this result in "Output B" where you get a beatiful sequence of addsd into register xmm2.
It's all very strange and it points to some questionable decision making on the part of llvm. I tried different versions of the sum() function (elliminating the loop for example) but it does not help. Another observation is that the loop variable i (in foo) must be involved: if one does *p = 5 (instead of p[i] = 5), the problem also goes away.
Until a fix comes, if you have a suggestion on how to get around this problem please let me know.
Here is the code:
double sum( double* v, int v_siz ) { double sum = 0.0; int i = 0;
for (; i != v_siz; ++i) sum += v[i];
return sum; }
double foo(double x, int p, int k) { double s = 0.0; for (int i = 0; i != k;++i) { s += sum(&x[i], 18); p[i] = 5; // xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx } return s; }
====== Output A ======
foo: # @foo .Ltmp12: .cfi_startproc
BB#0:
.Ltmp13: .cfi_def_cfa_offset 8 pushl %edi .Ltmp14: .cfi_def_cfa_offset 12 pushl %esi .Ltmp15: .cfi_def_cfa_offset 16 subl $88, %esp .Ltmp16: .cfi_def_cfa_offset 104 .Ltmp17: .cfi_offset %esi, -16 .Ltmp18: .cfi_offset %edi, -12 .Ltmp19: .cfi_offset %ebx, -8 pxor %xmm0, %xmm0 movl 112(%esp), %eax testl %eax, %eax je .LBB1_3
BB#1:
.LBB1_2: # %.lr.ph.i
=>This Inner Loop Header: Depth=1
.LBB1_3: # %._crit_edge movsd %xmm0, 80(%esp) fldl 80(%esp) addl $88, %esp popl %esi popl %edi popl %ebx ret .Ltmp20: .size foo, .Ltmp20-foo .Ltmp21: .cfi_endproc .Leh_func_end1:
====== Output B ======
foo: # @foo .Ltmp11: .cfi_startproc
BB#0:
.Ltmp12: .cfi_def_cfa_offset 8 pushl %esi .Ltmp13: .cfi_def_cfa_offset 12 subl $12, %esp .Ltmp14: .cfi_def_cfa_offset 24 .Ltmp15: .cfi_offset %esi, -12 .Ltmp16: .cfi_offset %edi, -8 pxor %xmm0, %xmm0 movl 32(%esp), %eax testl %eax, %eax je .LBB1_3
BB#1:
.LBB1_2: # %.lr.ph.i
=>This Inner Loop Header: Depth=1
.LBB1_3: # %._crit_edge movsd %xmm0, (%esp) fldl (%esp) addl $12, %esp popl %esi popl %edi ret