systems-nuts / unifico

Compiler and build harness for heterogeneous-ISA binaries with the same stack layout.
3 stars 1 forks source link

`c-print-results`: Different spill order of arguments #291

Closed blackgeorge-boom closed 10 months ago

blackgeorge-boom commented 11 months ago
void results(char *name, char class, int n1, int n2, int n3, int niter,
                     double t, double mops, char *optype,
                     int passed_verification, char *npbversion,
                     char *compiletime, char *cc, char *clink, char *c_lib,
                     char *c_inc, char *cflags, char *clinkflags)
{
    printf("%c\n", class);
    if (n3 == 0) {
        n3++;
    }
    else
        printf("%4dx%4dx%4d\n", n1, n2, n3);
}

int main() {

    double timecounter = 0.0;

    results(
            "IS", CLASS, 1, 64, 0, 3, timecounter, 1.0,
            "keys ranked", 1, NPBVERSION, COMPILETIME, CC, CLINK,
            C_LIB, C_INC, CFLAGS, CLINKFLAGS);

    return 0;

}

Examining the regalloc debug info, we see that the following corresponding registers are allocated differently. This is because of different live interval weights (calculated as UseDefFreq / (Size + 25*SlotIndex::InstrDist)) that lead to different evictions:

AArch64

%4: 4.38 / (400 + 400) = 4.38 * 1.25 = 5.47
%14: 3.03 / (168 + 400) = 3.03 * 1.76 = 5.33

X86

%4: 4.38 / (512 + 400) = 4.38 * 1.1 = 4.80
%18: 3.03 / (216 + 400) = 3.03 * 1.62 = 4.92
blackgeorge-boom commented 10 months ago

The cause behind the different live interval weights are probably some gaps inside the slot numbering, because of different instruction removals that precede greedy.

AArch64:

16B   %4:gpr32 = COPY $w4
32B   %3:gpr32 = COPY $w3
48B   %2:gpr32 = COPY $w2
64B   %1:gpr32 = COPY $w1
80B   ADJCALLSTACKDOWN 0, 0, implicit-def dead $sp, implicit $sp
112B      $x0 = MOVaddr target-flags(aarch64-page) @main__str__c__, target-flags(aarch64-pageoff, aarch64-nc) @main__str__c__

X86:

0B  bb.0.entry:
      successors: %bb.2(0x30000000), %bb.1(0x50000000); %bb.2(37.50%), %bb.1(62.50%)
      liveins: $esi, $edx, $ecx, $r8d
16B   MOV32mr %stack.0, 1, $noreg, 0, $noreg, $r8d :: (store 4 into %stack.0)
32B   MOV32mr %stack.1, 1, $noreg, 0, $noreg, $ecx :: (store 4 into %stack.1)
48B   %23:gr32 = COPY $edx
64B   %1:gr32 = COPY $esi
80B   ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
144B      $rdi = LEA64r $rip, 1, $noreg, @main__str__c__, $noreg

The last instructions seem further apart in the two architectures, even though they aren't. This could be fixed by "packing" the slot indexes of instructions right before greedy.

blackgeorge-boom commented 10 months ago

An attempt for that is already underway in LLVM, which we will try to add to our LLVM as well: https://github.com/llvm/llvm-project/pull/67038