Different spill behavior in for loop when passing elements of array into function

blackgeorge-boom commented 1 year ago

#include <stdio.h>

int fmul(int x, int y) { return x * y; }

int loop(int *a1, int *a2, int len)
{
    int sum = 0;
    for (int i = 0; i < len; i++) {
        sum += fmul(a1[i], a2[i]);   // <---
    }
    return sum;
}

int main()
{
    int a1[8];
    int a2[8];
    int r1 = loop(a1, a2, 8);
    printf("%d", r1);
    return 0;
}

AArch64:

0000000000501040 loop:
...
  501050: 5f 04 00 71                   cmp w2, #0x1
  501054: 4b 02 00 54                   b.lt    #0x48 <loop+0x5c>
  501058: e8 03 02 2a                   mov w8, w2
  50105c: f4 03 1f aa                   mov x20, xzr
  501060: f3 03 1f 2a                   mov w19, wzr
  501064: e8 13 00 f9                   str x8, [sp, #0x20]
  501068: e0 87 00 a9                   stp x0, x1, [sp, #0x8]
  50106c: 88 f6 7e d3                   lsl x8, x20, #2
  501070: 00 68 68 b8                   ldr w0, [x0, x8]
  501074: 21 68 68 b8                   ldr w1, [x1, x8]
  501078: e1 03 03 29                   stp w1, w0, [sp, #0x18]   <--- extra spill of w1
  50107c: e9 ff ff 97                   bl  #-0x5c <fmul>
...

X86:

0000000000501040 <loop>:
...
  50104b:   85 d2                   test   edx,edx
  50104d:   0f 8e 54 00 00 00       jle    5010a7 <loop+0x67>
  501053:   89 d0                   mov    eax,edx
  501055:   48 89 45 d8             mov    QWORD PTR [rbp-0x28],rax
  501059:   31 c9                   xor    ecx,ecx
  50105b:   31 d2                   xor    edx,edx
  50105d:   48 89 75 c8             mov    QWORD PTR [rbp-0x38],rsi
  501061:   48 89 7d c0             mov    QWORD PTR [rbp-0x40],rdi
  501065:   89 55 e4                mov    DWORD PTR [rbp-0x1c],edx   <--- Extra spill of arg3 (edx)
  501068:   8b 3c 8f                mov    edi,DWORD PTR [rdi+rcx*4]
  50106b:   89 7d d4                mov    DWORD PTR [rbp-0x2c],edi
  50106e:   44 8b 3c 8e             mov    r15d,DWORD PTR [rsi+rcx*4]   <--- Use of CSR2 (r15d) instead of spilling
  501072:   44 89 fe                mov    esi,r15d
  501075:   48 89 cb                mov    rbx,rcx
  501078:   0f 1f 00                nop    DWORD PTR [rax]
  50107b:   e8 a0 ff ff ff          call   501020 <fmul>
...

blackgeorge-boom commented 1 year ago

Right after coalescing:

AArch64:

624B      %20:gpr32 = COPY killed $w8
640B      %26:gpr32 = nsw ADDWrr %20:gpr32, %26:gpr32

X86

592B      %19:gr32temp = COPY killed $eax
624B      %19:gr32temp = nsw ADD32rr %19:gr32temp(tied-def 0), %21:gr32temp, implicit-def dead $eflags
...
720B      %21:gr32temp = COPY %19:gr32temp

blackgeorge-boom commented 1 year ago

It seems that %26 is used for sum and %20 is used for the return value.
While in X86, %19 is used for the return value and %21 is used for the sum.
But, sum needs to be initialized, which gives a larger live range for %26 in AArch64, leading to an eviction and different allocation.

Intervals:

AArch64:

%26 [144r,176B:3)[304r,336B:0)[336B,368r:4)[400B,640r:1)[640r,800B:2)  0@304r 1@400B-phi 2@640r 3@144r 4@336B-phi weight:1.626846e-01

X86:

%19 [112r,176B:0)[320B,352r:2)[592r,624r:3)[624r,768B:1)  0@112r 1@624r 2@320B-phi 3@592r weight:3.078813e-01

blackgeorge-boom commented 1 year ago

The above are probably caused by a previous modification: https://github.com/blackgeorge-boom/llvm-project/pull/33

blackgeorge-boom commented 1 year ago

The problem is that wzr in AArch64 can be copied to any register, whereas MOV32r0 can only be copied to temp registers, after the above PR. We can hack the copies from WZR to use temp registers, but the subsequent copies do not do that:

224B      %15:gpr32_and_gpr32temp = COPY $wzr
240B      %13:gpr32all = COPY %15:gpr32_and_gpr32temp
    ...
304B      %26:gpr32 = COPY %13:gpr32all

The problem is that the coalescer crushes the above to sth like:

304B      %26:gpr32 = COPY $wzr

But this is not an issue with X86:

224B      %14:gr32temp = MOV32r0 implicit-def dead $eflags
240B      %12:gr32 = COPY %14:gr32temp
    ...
288B      %21:gr32 = COPY %12:gr32

turns into:

288B      %21:gr32temp = MOV32r0 implicit-def dead $eflags

which is because of rematerialization.

blackgeorge-boom commented 1 year ago

We can fix that by modifying the coalescing process.

systems-nuts / unifico

Different spill behavior in for loop when passing elements of array into function #256