llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.07k stars 11.99k forks source link

i386-on-i386 build is different from i386-on-amd64 build #99396

Closed cperciva closed 3 months ago

cperciva commented 3 months ago

On FreeBSD we see a couple binaries which build differently for the i386 target depending on whether they are running on an amd64 host or on an "i386" host (in fact, i386 binaries running on an amd64 host).

I have narrowed it down to the Machine Instruction Scheduler step (running with -mllvm -enable-misched=0 prevents the divergence between i386 and amd64 hosts); looking at the produced assembly code it looks like different choices are made (with very low probability -- a few times out of the entire OS build) in the register allocator. In my reproducer:

Test case: clang -fpic -O2 -S reduced.ll with the reduced.ll.txt file produces i386-on-i386.asm.txt on an i386 host and i386-on-amd64.asm.txt on an amd64 host.

Clang/LLVM versions:

amd64 host:
FreeBSD clang version 18.1.3 (https://github.com/llvm/llvm-project.git llvmorg-18.1.3-0-gc13b7485b879)
Target: x86_64-unknown-freebsd15.0
Thread model: posix

i386 host:
FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67)
Target: i386-unknown-freebsd15.0
Thread model: posix

The test case was reduced from contrib/libarchive/libarchive/archive_write_set_format_7zip.c (FreeBSD's included copy of libarchive).

llvmbot commented 3 months ago

@llvm/issue-subscribers-backend-x86

Author: Colin Percival (cperciva)

On FreeBSD we see a couple binaries which build differently for the i386 target depending on whether they are running on an amd64 host or on an "i386" host (in fact, i386 binaries running on an amd64 host). I have narrowed it down to the Machine Instruction Scheduler step (running with `-mllvm -enable-misched=0` prevents the divergence between i386 and amd64 hosts); looking at the produced assembly code it looks like different choices are made (with very low probability -- a few times out of the entire OS build) in the register allocator. In my reproducer: * esi is used when building on i386 where edx is used when building on amd64. * eax is used when building on i386 where esi is used when building on amd64. Test case: `clang -fpic -O2 -S reduced.ll` with the [reduced.ll.txt](https://github.com/user-attachments/files/16275043/reduced.ll.txt) file produces [i386-on-i386.asm.txt](https://github.com/user-attachments/files/16275041/i386-on-i386.asm.txt) on an i386 host and [i386-on-amd64.asm.txt](https://github.com/user-attachments/files/16275042/i386-on-amd64.asm.txt) on an amd64 host. Clang/LLVM versions: ``` amd64 host: FreeBSD clang version 18.1.3 (https://github.com/llvm/llvm-project.git llvmorg-18.1.3-0-gc13b7485b879) Target: x86_64-unknown-freebsd15.0 Thread model: posix i386 host: FreeBSD clang version 18.1.6 (https://github.com/llvm/llvm-project.git llvmorg-18.1.6-0-g1118c2e05e67) Target: i386-unknown-freebsd15.0 Thread model: posix ``` The test case was reduced from `contrib/libarchive/libarchive/archive_write_set_format_7zip.c` (FreeBSD's included copy of libarchive).
cperciva commented 3 months ago

@EugeneZelenko Not sure why you tagged this as "question" -- unless you think "cross-building should produce the same results as native builds" is a question?

zygoloid commented 3 months ago

Have you verified that this difference is due to the host rather than nondeterminism? (Are the results reliably reproducible?)

cperciva commented 3 months ago

Yes, completely deterministic.

DimitryAndric commented 3 months ago

FWIW this still reproduces with llvmorg-19-init-18129-gaf5352fe8e66:

--- out-amd64.s 2024-07-19 09:21:52.437611000 +0200
+++ out-i386.s  2024-07-19 09:21:53.107307000 +0200
@@ -24,45 +24,46 @@
 .Ltmp0:
        addl    $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %ebx
        movl    fn4@GOT(%ebx), %edi
-       movl    (%edi), %edx
+       movl    (%edi), %esi
        movl    c@GOT(%ebx), %eax
-       movl    (%eax), %esi
-       testl   %esi, %esi
+       movl    (%eax), %eax
+       testl   %eax, %eax
        jne     .LBB0_3
 # %bb.1:                                # %if.then.g
        cmpl    $0, 1
        js      .LBB0_2
 .LBB0_3:                                # %if.end3.g
-       testl   %edx, %edx
-       je      .LBB0_7
+       testl   %esi, %esi
+       je      .LBB0_8
+# %bb.4:                                # %while.body.g.preheader
+       movl    %eax, -16(%ebp)                 # 4-byte Spill
        .p2align        4, 0x90
-.LBB0_4:                                # %while.body.g
+.LBB0_5:                                # %while.body.g
                                         # =>This Inner Loop Header: Depth=1
-       movl    %edx, -16(%ebp)                 # 4-byte Spill
-       pushl   %edx
-       pushl   %edi
        pushl   %esi
+       pushl   %edi
+       pushl   %eax
        calll   fn3@PLT
        addl    $12, %esp
        testl   %eax, %eax
-       js      .LBB0_5
-# %bb.6:                                # %if.end8.g
-                                        #   in Loop: Header=BB0_4 Depth=1
+       js      .LBB0_6
+# %bb.7:                                # %if.end8.g
+                                        #   in Loop: Header=BB0_5 Depth=1
        addl    $4, %edi
-       movl    $0, 4(%esi)
-       movl    $0, (%esi)
-       movl    -16(%ebp), %edx                 # 4-byte Reload
-       incl    %edx
-       jne     .LBB0_4
-       jmp     .LBB0_7
-.LBB0_5:                                # %if.then6.g
+       movl    -16(%ebp), %eax                 # 4-byte Reload
+       movl    $0, 4(%eax)
+       movl    $0, (%eax)
+       incl    %esi
+       jne     .LBB0_5
+       jmp     .LBB0_8
+.LBB0_6:                                # %if.then6.g
        pushl   $0
        calll   fn2@PLT
        addl    $4, %esp
-       jmp     .LBB0_7
+       jmp     .LBB0_8
 .LBB0_2:                                # %if.then2.g
        movl    0, %eax
-.LBB0_7:                                # %f.exit
+.LBB0_8:                                # %f.exit
        xorl    %eax, %eax
        addl    $4, %esp
        popl    %esi

I tried running both the x86_64 and i386-hosted compiler binaries with -mllvm -misched-print-dags, but those printed the same output:

SU(0):   %6:gr32_nosp = MOVPC32r 0, implicit $esp, implicit $ssp
  # preds left       : 0
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 12
  Successors:
    SU(1): Data Latency=1 Reg=%6
    SU(1): Out  Latency=0
  Pressure Diff      : GR16 -2
  Single Issue       : false;
SU(1):   %6:gr32_nosp = ADD32ri %6:gr32_nosp(tied-def 0), target-flags(x86-got-absolute-address) &_GLOBAL_OFFSET_TABLE_, implicit-def dead $eflags
  # preds left       : 2
  # succs left       : 3
  # rdefs left       : 0
  Latency            : 1
  Depth              : 1
  Height             : 11
  Predecessors:
    SU(0): Data Latency=1 Reg=%6
    SU(0): Out  Latency=0
  Successors:
    SU(4): Data Latency=1 Reg=%6
    SU(2): Data Latency=1 Reg=%6
    SU(6): Out  Latency=0
  Pressure Diff      :
  Single Issue       : false;
SU(2):   %17:gr32 = MOV32rm %6:gr32_nosp, 1, $noreg, target-flags(x86-got) @fn4, $noreg :: (load (s32) from got)
  # preds left       : 1
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 5
  Depth              : 2
  Height             : 9
  Predecessors:
    SU(1): Data Latency=1 Reg=%6
  Successors:
    SU(3): Data Latency=5 Reg=%17
  Pressure Diff      : GR16 -2
  Single Issue       : false;
SU(3):   %18:gr32 = MOV32rm %17:gr32, 1, $noreg, 0, $noreg :: (load (s32) from @fn4)
  # preds left       : 1
  # succs left       : 2
  # rdefs left       : 0
  Latency            : 5
  Depth              : 7
  Height             : 4
  Predecessors:
    SU(2): Data Latency=5 Reg=%17
  Successors:
    ExitSU: Ord  Latency=4 Artificial
    SU(6): Ord  Latency=0 Artificial
  Pressure Diff      : GR16 -2
  Single Issue       : false;
SU(4):   %8:gr32 = MOV32rm %6:gr32_nosp, 1, $noreg, target-flags(x86-got) @c, $noreg :: (load (s32) from got)
  # preds left       : 1
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 5
  Depth              : 2
  Height             : 10
  Predecessors:
    SU(1): Data Latency=1 Reg=%6
  Successors:
    SU(5): Data Latency=5 Reg=%8
  Pressure Diff      : GR16 -2
  Single Issue       : false;
SU(5):   %1:gr32 = MOV32rm %8:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from @c)
  # preds left       : 1
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 5
  Depth              : 7
  Height             : 5
  Predecessors:
    SU(4): Data Latency=5 Reg=%8
  Successors:
    SU(6): Data Latency=5 Reg=%1
  Pressure Diff      :
  Single Issue       : false;
SU(6):   TEST32rr %1:gr32, %1:gr32, implicit-def $eflags
  # preds left       : 3
  # succs left       : 1
  # weak succs left  : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 12
  Height             : 0
  Predecessors:
    SU(5): Data Latency=5 Reg=%1
    SU(1): Out  Latency=0
    SU(3): Ord  Latency=0 Artificial
  Successors:
    ExitSU: Ord  Latency=0 Artificial
    ExitSU: Ord  Latency=0 Cluster
  Pressure Diff      :
  Single Issue       : false;
ExitSU:   JCC_1 %bb.3, 5, implicit killed $eflags
  # preds left       : 2
  # succs left       : 0
  # weak preds left  : 1
  # rdefs left       : 0
  Latency            : 0
  Depth              : 12
  Height             : 0
  Predecessors:
    SU(6): Ord  Latency=0 Artificial
    SU(3): Ord  Latency=4 Artificial
    SU(6): Ord  Latency=0 Cluster
SU(0):   %11:gr32 = COPY killed $eax
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 1
  Successors:
    SU(1): Data Latency=1 Reg=%11
  Single Issue       : false;
SU(1):   TEST32rr %11:gr32, %11:gr32, implicit-def $eflags
  # preds left       : 1
  # succs left       : 1
  # weak succs left  : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 1
  Height             : 0
  Predecessors:
    SU(0): Data Latency=1 Reg=%11
  Successors:
    ExitSU: Ord  Latency=0 Artificial
    ExitSU: Ord  Latency=0 Cluster
  Single Issue       : false;
ExitSU:   JCC_1 %bb.7, 9, implicit killed $eflags
  # preds left       : 1
  # succs left       : 0
  # weak preds left  : 1
  # rdefs left       : 0
  Latency            : 0
  Depth              : 1
  Height             : 0
  Predecessors:
    SU(1): Ord  Latency=0 Artificial
    SU(1): Ord  Latency=0 Cluster
SU(0):   %17:gr32 = ADD32ri %17:gr32(tied-def 0), 4, implicit-def dead $eflags
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 0
  Successors:
    SU(1): Out  Latency=0
  Pressure Diff      :
  Single Issue       : false;
SU(1):   %18:gr32 = INC32r %18:gr32(tied-def 0), implicit-def $eflags
  # preds left       : 3
  # succs left       : 1
  # weak succs left  : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 0
  Predecessors:
    SU(0): Out  Latency=0
    SU(2): Ord  Latency=0 Artificial
    SU(3): Ord  Latency=0 Artificial
  Successors:
    ExitSU: Ord  Latency=0 Artificial
    ExitSU: Ord  Latency=0 Cluster
  Pressure Diff      :
  Single Issue       : false;
SU(2):   MOV32mi %1:gr32, 1, $noreg, 4, $noreg, 0 :: (store (s32) into %ir.1 + 4)
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 0
  Successors:
    SU(1): Ord  Latency=0 Artificial
  Pressure Diff      :
  Single Issue       : false;
SU(3):   MOV32mi %1:gr32, 1, $noreg, 0, $noreg, 0 :: (store (s32) into %ir.1)
  # preds left       : 0
  # succs left       : 1
  # rdefs left       : 0
  Latency            : 1
  Depth              : 0
  Height             : 0
  Successors:
    SU(1): Ord  Latency=0 Artificial
  Pressure Diff      :
  Single Issue       : false;
ExitSU:   JCC_1 %bb.5, 5, implicit killed $eflags
  # preds left       : 1
  # succs left       : 0
  # weak preds left  : 1
  # rdefs left       : 0
  Latency            : 0
  Depth              : 0
  Height             : 0
  Predecessors:
    SU(1): Ord  Latency=0 Artificial
    SU(1): Ord  Latency=0 Cluster
cperciva commented 3 months ago

Difference visible in RegAllocBase::allocatePhysRegs in case this helps:

# diff -u amd64.out i386.out  
--- amd64.out    2024-07-21 08:07:50.522179000 +0000
+++ i386.out     2024-07-21 08:07:53.584991000 +0000
@@ -25,7 +25,7 @@

 selectOrSplit GR32:%23 [64r,240B:0)[288B,400B:0)[400B,524r:1)[892r,928B:2) 0@64r 1@400B-phi 2@892r  weight:5.352325e-02 w=5.352325e-02

-selectOrSplit GR32:%24 [524r,624B:0)[736B,768r:0)[768r,892r:1) 0@524r 1@768r  weight:7.938738e-02 w=7.938738e-02
+selectOrSplit GR32:%24 [524r,624B:0)[736B,768r:0)[768r,892r:1) 0@524r 1@768r  weight:7.938739e-02 w=7.938739e-02

 selectOrSplit GR32:%18 [80r,240B:0)[288B,400B:0)[400B,624B:2)[736B,848r:2)[848r,928B:1) 0@80r 1@848r 2@400B-phi  weight:5.647660e-02 w=5.647660e-02

@@ -33,12 +33,14 @@

 selectOrSplit GR32:%26 [80r,240B:0)[288B,400B:0)[400B,532r:1)[900r,928B:2) 0@80r 1@400B-phi 2@900r  weight:5.438390e-02 w=5.438390e-02

-selectOrSplit GR32:%27 [532r,624B:0)[736B,848r:0)[848r,900r:1) 0@532r 1@848r  weight:7.938738e-02 w=7.938738e-02
+selectOrSplit GR32:%27 [532r,624B:0)[736B,848r:0)[848r,900r:1) 0@532r 1@848r  weight:7.938739e-02 w=7.938739e-02

-selectOrSplit GR32:%27 [532r,624B:0)[736B,848r:0)[848r,900r:1) 0@532r 1@848r  weight:7.938738e-02 w=7.938738e-02
+selectOrSplit GR32:%21 [520r,624B:0)[736B,888r:0) 0@520r  weight:7.938738e-02 w=7.938738e-02

-selectOrSplit GR32:%28 [532r,624B:0)[736B,840r:0) 0@532r  weight:5.871526e-02 w=5.871526e-02
+selectOrSplit GR32:%21 [520r,624B:0)[736B,888r:0) 0@520r  weight:7.938738e-02 w=7.938738e-02

-selectOrSplit GR32:%29 [840r,848r:0)[848r,900r:1) 0@840r 1@848r  weight:7.427752e-02 w=7.427752e-02
+selectOrSplit GR32:%28 [520r,624B:0)[736B,808r:0) 0@520r  weight:6.075398e-02 w=6.075398e-02

-selectOrSplit GR32:%26 [80r,240B:0)[288B,400B:0)[400B,480r:1)[900r,928B:2) 0@80r 1@400B-phi 2@900r  weight:5.438390e-02 w=5.438390e-02
+selectOrSplit GR32:%29 [808r,888r:0) 0@808r  weight:7.118262e-02 w=7.118262e-02
+
+selectOrSplit GR32:%20 [112r,240B:0)[288B,400B:0)[400B,512r:1)[888r,928B:2) 0@112r 1@400B-phi 2@888r  weight:5.674269e-02 w=5.674269e-02
DimitryAndric commented 3 months ago

Hmm, 7.938738e-02 vs 7.938739e-02 ? Looks like some very slight floating point difference, so maybe this has to do with fpmath=sse versus fpmath=387 precision... :)

DimitryAndric commented 3 months ago

@cperciva I don't know which option you used to generate that output, but if I use -fsave-optimization-record the resulting yaml files also show a difference in floating values:

--- out-amd64.opt.yaml  2024-07-21 12:16:29.629665000 +0200
+++ out-i386.opt.yaml   2024-07-21 12:17:40.275171000 +0200
@@ -19,10 +19,6 @@
 Name:            LoopSpillReloadCopies
 Function:        fn4
 Args:
-  - NumSpills:       '1'
-  - String:          ' spills '
-  - TotalSpillsCost: '8.730159e+00'
-  - String:          ' total spills cost '
   - NumReloads:      '1'
   - String:          ' reloads '
   - TotalReloadsCost: '8.457341e+00'
@@ -36,7 +32,7 @@
 Args:
   - NumSpills:       '1'
   - String:          ' spills '
-  - TotalSpillsCost: '8.730159e+00'
+  - TotalSpillsCost: '5.371094e-01'
   - String:          ' total spills cost '
   - NumReloads:      '1'
   - String:          ' reloads '
@@ -127,6 +123,19 @@
 Function:        fn4
 Args:
   - String:          'BasicBlock: '
+  - BasicBlock:      while.body.g.preheader
+  - String:          "\n"
+  - String:          'movl     '
+  - String:          ': '
+  - INST_movl:       '1'
+  - String:          "\n"
+...
+--- !Analysis
+Pass:            asm-printer
+Name:            InstructionMix
+Function:        fn4
+Args:
+  - String:          'BasicBlock: '
   - BasicBlock:      while.body.g
   - String:          "\n"
   - String:          'pushl    '
@@ -144,10 +153,6 @@
   - String:          j
   - String:          ': '
   - INST_j:          '1'
-  - String:          "\n"
-  - String:          'movl     '
-  - String:          ': '
-  - INST_movl:       '1'
   - String:          "\n"
   - String:          'testl    '
   - String:          ': '
cperciva commented 3 months ago

@cperciva I don't know which option you used to generate that output

Sorry I should have clarified, that's from the line

    LLVM_DEBUG(dbgs() << "\nselectOrSplit "
                      << TRI->getRegClassName(MRI->getRegClass(VirtReg->reg()))
                      << ':' << *VirtReg << " w=" << VirtReg->weight() << '\n');

except that I edited out LLVM_DEBUG. I can't build a debug-enabled LLVM on i386 (the linker runs out of memory) so I resorted to editing the LLVM source code to enable debugging in specific locations.

cperciva commented 3 months ago

Hmm, 7.938738e-02 vs 7.938739e-02 ? Looks like some very slight floating point difference, so maybe this has to do with fpmath=sse versus fpmath=387 precision... :)

That was my hunch even before I saw those numbers, just based on "difference between i386 and amd64 which shows up very very rarely". But the FPU is supposed to be set to double precision (not long double) so this shouldn't be an issue.

cperciva commented 3 months ago

Confirmed, fpgetprec() is returning 2 (double precision) so the x87 should be computing the same results as we get with fpmath=sse.

DimitryAndric commented 3 months ago

I've been debugging this into VirtRegAuxInfo::weightCalcHelper(), where the float returned slightly differs between amd64 and i386. During the whole function, the TotalWeight variable is exactly the same for both architectures, but at the end it returns normalize(TotalWeight, LI.getSize(), NumInstr).

Interestingly normalize just calls normalizeSpillWeight, which looks simple enough:

  /// Normalize the spill weight of a live interval
  ///
  /// The spill weight of a live interval is computed as:
  ///
  ///   (sum(use freq) + sum(def freq)) / (K + size)
  ///
  /// @param UseDefFreq Expected number of executed use and def instructions
  ///                   per function call. Derived from block frequencies.
  /// @param Size       Size of live interval as returnexd by getSize()
  /// @param NumInstr   Number of instructions using this live interval
  static inline float normalizeSpillWeight(float UseDefFreq, unsigned Size,
                                           unsigned NumInstr) {
    // The constant 25 instructions is added to avoid depending too much on
    // accidental SlotIndex gaps for small intervals. The effect is that small
    // intervals have a spill weight that is mostly proportional to the number
    // of uses, while large intervals get a spill weight that is closer to a use
    // density.
    return UseDefFreq / (Size + 25*SlotIndex::InstrDist);
  }

In this case, SlotIndex::InstrDist is 16 for both amd64 and i386, but still the function returns 7.938738e-02 on amd64, as opposed to 7.938739e-02 on i386...

Then I modified the debug prints to use hex floats, and the differences in printed float values disappeared! But also, the difference between the resultin assembly disappeared. :-)

cperciva commented 3 months ago

Ooh, it might be a float-vs-double issue, not a double-vs-extended-double issue. I'm guessing fpmath=sse is using 32-bit registers to operate on floats, but the x87 is going to keep values as doubles.

DimitryAndric commented 3 months ago

Being a bit more careful shows that it's apparently in incoming difference in the UseDefFreq parameter to normalize, for example:

< normalizeSpillWeight: UseDefFreq=5.207812e+01 (0x1.a09ffep+5), ret=7.938738e-02 (0x1.452bb4p-4)
> normalizeSpillWeight: UseDefFreq=5.207812e+01 (0x1.a0ap+5), ret=7.938739e-02 (0x1.452bb6p-4)

The value normally printed by llvm's raw_ostream is limited to 6 characters of decimal precision, but you can see that the hexfloat values are pretty close, though not equal.

DimitryAndric commented 3 months ago

So as soon as I put a few debug prints of the TotalWeight variable in VirtRegAuxInfo::weightCalcHelper(), the amd64/i386 differences disappear. It is definitely something in this particular function that is sensitive to the difference between the architectures.

DimitryAndric commented 3 months ago

In VirtRegAuxInfo::weightCalcHelper it is the loop that starts with:

  for (MachineRegisterInfo::reg_instr_nodbg_iterator
           I = MRI.reg_instr_nodbg_begin(LI.reg()),
           E = MRI.reg_instr_nodbg_end();

which introduces this issue. In that loop there is also the rather dodgy-looking part:

    // Force hweight onto the stack so that x86 doesn't add hidden precision,
    // making the comparison incorrectly pass (i.e., 1 > 1 == true??).
    //
    // FIXME: we probably shouldn't use floats at all.
    volatile float HWeight = Hint[HintReg] += Weight;
    if (HintReg.isVirtual() || MRI.isAllocatable(HintReg))
      CopyHints.insert(CopyHint(HintReg, HWeight));
  }

It looks like a similar hack might be needed for the local Weight and TotalWeight variables... Or indeed, as the comment indicates: probably shouldn't have used floats to begin with :)

DimitryAndric commented 3 months ago

So this tiny diff:

diff --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index 1d767a3484bc..849da3572efd 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -257,7 +257,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
       return -1.0f;
     }

-    float Weight = 1.0f;
+    volatile float Weight = 1.0f;
     if (IsSpillable) {
       // Get loop info for mi.
       if (MI->getParent() != MBB) {

makes the difference between amd64-hosted and i386-hosted output go away. I'm both satisfied and appalled. :)

DimitryAndric commented 3 months ago

Or alternatively, which I think is better and gets rid of the volatiles:

diff --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index 1d767a3484bc..f40a1d6b0a62 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -163,7 +163,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
   const TargetRegisterInfo &TRI = *MF.getSubtarget().getRegisterInfo();
   const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
   MachineBasicBlock *MBB = nullptr;
-  float TotalWeight = 0;
+  double TotalWeight = 0;
   unsigned NumInstr = 0; // Number of instructions using LI
   SmallPtrSet<MachineInstr *, 8> Visited;

@@ -207,8 +207,8 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
   // CopyHint is a sortable hint derived from a COPY instruction.
   struct CopyHint {
     const Register Reg;
-    const float Weight;
-    CopyHint(Register R, float W) : Reg(R), Weight(W) {}
+    const double Weight;
+    CopyHint(Register R, double W) : Reg(R), Weight(W) {}
     bool operator<(const CopyHint &Rhs) const {
       // Always prefer any physreg hint.
       if (Reg.isPhysical() != Rhs.Reg.isPhysical())
@@ -221,7 +221,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,

   bool IsExiting = false;
   std::set<CopyHint> CopyHints;
-  DenseMap<unsigned, float> Hint;
+  DenseMap<unsigned, double> Hint;
   for (MachineRegisterInfo::reg_instr_nodbg_iterator
            I = MRI.reg_instr_nodbg_begin(LI.reg()),
            E = MRI.reg_instr_nodbg_end();
@@ -257,7 +257,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
       return -1.0f;
     }

-    float Weight = 1.0f;
+    double Weight = 1.0;
     if (IsSpillable) {
       // Get loop info for mi.
       if (MI->getParent() != MBB) {
@@ -284,11 +284,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
     Register HintReg = copyHint(MI, LI.reg(), TRI, MRI);
     if (!HintReg)
       continue;
-    // Force hweight onto the stack so that x86 doesn't add hidden precision,
-    // making the comparison incorrectly pass (i.e., 1 > 1 == true??).
-    //
-    // FIXME: we probably shouldn't use floats at all.
-    volatile float HWeight = Hint[HintReg] += Weight;
+    double HWeight = Hint[HintReg] += Weight;
     if (HintReg.isVirtual() || MRI.isAllocatable(HintReg))
       CopyHints.insert(CopyHint(HintReg, HWeight));
   }
@@ -309,7 +305,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
     }

     // Weakly boost the spill weight of hinted registers.
-    TotalWeight *= 1.01F;
+    TotalWeight *= 1.01;
   }

   // If the live interval was already unspillable, leave it that way.
@@ -336,7 +332,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
   // FIXME: this gets much more complicated once we support non-trivial
   // re-materialization.
   if (isRematerializable(LI, LIS, VRM, *MF.getSubtarget().getInstrInfo()))
-    TotalWeight *= 0.5F;
+    TotalWeight *= 0.5;

   if (IsLocalSplitArtifact)
     return normalize(TotalWeight, Start->distance(*End), NumInstr);

Note the volatile qualifier was introduced in 7af3432e22b04 by @dexonsmith, as being a "gross hack" which I agree with. :) Duncan, would you agree that switching this function to doubles is better? At least you can get rid of the volatile, then.

cperciva commented 3 months ago

At least you can get rid of the volatile, then.

Only if the x87 is set to double precision. It is on FreeBSD but at least 20 years ago Linux was setting it to extended double by default.

Not saying we shouldn't make this change, but it would be good to confirm that it doesn't cause problems on other platforms. If @dexonsmith has a test case from 10 years ago, that is...

DimitryAndric commented 3 months ago

Well, I'm only trying a bunch of stopgaps here. The 7af3432e22b04522b63b159f0a07b2e9c4274a4a commit looks like something that was added for Apple, since it has a rdar:// URI, so my guess is that macOS at least behaves similarly in this area to FreeBSD... But indeed, I will do a test run on Linux too.

Note that if this were to become a pull request, I would very likely have to go over changed test cases too, since changing to double might slightly impact existing weights. In that sense, slapping in an additional volatile is likely to have the least impact.

cperciva commented 3 months ago

Sounds good. I'd like to have something in place for FreeBSD -- not necessarily a patch which can be accepted into LLVM but something which fixes the situation for us -- in the next few weeks since this causes problems for FreeBSD release building and we have 13.4 coming up shortly. Do you think that's feasible?

DimitryAndric commented 3 months ago

I just did a full buildworld on both amd64 and i386 hosts, targeting i386, and now the differences between the i386.i386/lib/libarchive/archive_write_set_format_7zip.pico files, and between the i386.i386/usr.bin/clang/lld/ELF/SyntheticSections.o files (as noted in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276961#c3) have disappeared. I.e. these were both due to this register allocator issue, apparently.

That said, there is still another unexplained difference, the ipmon.o case:

--- /jails/containers/bug276961-amd64/usr/obj/usr/src/i386.i386/sbin/ipf/ipmon/ipmon.s 2024-05-07 21:09:50.385539000 +0200
+++ /jails/containers/bug276961-i386/usr/obj/usr/src/i386.i386/sbin/ipf/ipmon/ipmon.s   2024-05-07 21:10:26.083709000 +0200
@@ -19824,6 +19824,8 @@
        calll   free
 .Ltmp1533:
        addl    $4, %esp
+       .loc    1 363 13                        # src/sbin/ipf/ipmon/ipmon.c:363:13
+       movl    $0, protocols
 .Ltmp1534:
 .LBB4_6:                                # %if.end6
        .loc    1 365 23                        # src/sbin/ipf/ipmon/ipmon.c:365:23
@@ -19990,6 +19992,8 @@
        calll   free
 .Ltmp1565:
        addl    $4, %esp
+       .loc    1 387 13                        # src/sbin/ipf/ipmon/ipmon.c:387:13
+       movl    $0, udp_ports
 .Ltmp1566:
 .LBB4_22:                               # %if.end47
        .loc    1 389 23                        # src/sbin/ipf/ipmon/ipmon.c:389:23
0

but that is probably better reduced separately for another bug. It is still there, and seems to be caused by something else.

In any case, I guess slapping in the additional volatile is probably the best approach to import into FreeBSD for now. It seems very safe to me. :)

dexonsmith commented 3 months ago

Note the volatile qualifier was introduced in 7af3432e22b04 by @dexonsmith, as being a "gross hack" which I agree with. :) Duncan, would you agree that switching this function to doubles is better? At least you can get rid of the volatile, then.

Seems plausible to me. Or, could we use APFloat? I wonder if I tried that back then...

DimitryAndric commented 3 months ago

I did a check with llvmorg-19-init-18447-g102d16809b2c on Debian 13 (kernel version 6.9.9), both x86_64 and i386, and there the outcome of VirtRegAuxInfo::weightCalcHelper is the same for both architectures, even without a volatile qualifier. I'll have a look if any of the regression tests fall over due to different results, if I insert doubles instead of floats.

DimitryAndric commented 3 months ago

Ok, I tried the double replacement approach, but that leads to quite a large number of test regressions. So I don't think that is the best way forward. APFloat would be nice, but it is likely to open up yet another can of worms. :)