Closed cperciva closed 3 months ago
@llvm/issue-subscribers-backend-x86
Author: Colin Percival (cperciva)
@EugeneZelenko Not sure why you tagged this as "question" -- unless you think "cross-building should produce the same results as native builds" is a question?
Have you verified that this difference is due to the host rather than nondeterminism? (Are the results reliably reproducible?)
Yes, completely deterministic.
FWIW this still reproduces with llvmorg-19-init-18129-g
af5352fe8e66:
--- out-amd64.s 2024-07-19 09:21:52.437611000 +0200
+++ out-i386.s 2024-07-19 09:21:53.107307000 +0200
@@ -24,45 +24,46 @@
.Ltmp0:
addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %ebx
movl fn4@GOT(%ebx), %edi
- movl (%edi), %edx
+ movl (%edi), %esi
movl c@GOT(%ebx), %eax
- movl (%eax), %esi
- testl %esi, %esi
+ movl (%eax), %eax
+ testl %eax, %eax
jne .LBB0_3
# %bb.1: # %if.then.g
cmpl $0, 1
js .LBB0_2
.LBB0_3: # %if.end3.g
- testl %edx, %edx
- je .LBB0_7
+ testl %esi, %esi
+ je .LBB0_8
+# %bb.4: # %while.body.g.preheader
+ movl %eax, -16(%ebp) # 4-byte Spill
.p2align 4, 0x90
-.LBB0_4: # %while.body.g
+.LBB0_5: # %while.body.g
# =>This Inner Loop Header: Depth=1
- movl %edx, -16(%ebp) # 4-byte Spill
- pushl %edx
- pushl %edi
pushl %esi
+ pushl %edi
+ pushl %eax
calll fn3@PLT
addl $12, %esp
testl %eax, %eax
- js .LBB0_5
-# %bb.6: # %if.end8.g
- # in Loop: Header=BB0_4 Depth=1
+ js .LBB0_6
+# %bb.7: # %if.end8.g
+ # in Loop: Header=BB0_5 Depth=1
addl $4, %edi
- movl $0, 4(%esi)
- movl $0, (%esi)
- movl -16(%ebp), %edx # 4-byte Reload
- incl %edx
- jne .LBB0_4
- jmp .LBB0_7
-.LBB0_5: # %if.then6.g
+ movl -16(%ebp), %eax # 4-byte Reload
+ movl $0, 4(%eax)
+ movl $0, (%eax)
+ incl %esi
+ jne .LBB0_5
+ jmp .LBB0_8
+.LBB0_6: # %if.then6.g
pushl $0
calll fn2@PLT
addl $4, %esp
- jmp .LBB0_7
+ jmp .LBB0_8
.LBB0_2: # %if.then2.g
movl 0, %eax
-.LBB0_7: # %f.exit
+.LBB0_8: # %f.exit
xorl %eax, %eax
addl $4, %esp
popl %esi
I tried running both the x86_64 and i386-hosted compiler binaries with -mllvm -misched-print-dags
, but those printed the same output:
SU(0): %6:gr32_nosp = MOVPC32r 0, implicit $esp, implicit $ssp
# preds left : 0
# succs left : 2
# rdefs left : 0
Latency : 1
Depth : 0
Height : 12
Successors:
SU(1): Data Latency=1 Reg=%6
SU(1): Out Latency=0
Pressure Diff : GR16 -2
Single Issue : false;
SU(1): %6:gr32_nosp = ADD32ri %6:gr32_nosp(tied-def 0), target-flags(x86-got-absolute-address) &_GLOBAL_OFFSET_TABLE_, implicit-def dead $eflags
# preds left : 2
# succs left : 3
# rdefs left : 0
Latency : 1
Depth : 1
Height : 11
Predecessors:
SU(0): Data Latency=1 Reg=%6
SU(0): Out Latency=0
Successors:
SU(4): Data Latency=1 Reg=%6
SU(2): Data Latency=1 Reg=%6
SU(6): Out Latency=0
Pressure Diff :
Single Issue : false;
SU(2): %17:gr32 = MOV32rm %6:gr32_nosp, 1, $noreg, target-flags(x86-got) @fn4, $noreg :: (load (s32) from got)
# preds left : 1
# succs left : 1
# rdefs left : 0
Latency : 5
Depth : 2
Height : 9
Predecessors:
SU(1): Data Latency=1 Reg=%6
Successors:
SU(3): Data Latency=5 Reg=%17
Pressure Diff : GR16 -2
Single Issue : false;
SU(3): %18:gr32 = MOV32rm %17:gr32, 1, $noreg, 0, $noreg :: (load (s32) from @fn4)
# preds left : 1
# succs left : 2
# rdefs left : 0
Latency : 5
Depth : 7
Height : 4
Predecessors:
SU(2): Data Latency=5 Reg=%17
Successors:
ExitSU: Ord Latency=4 Artificial
SU(6): Ord Latency=0 Artificial
Pressure Diff : GR16 -2
Single Issue : false;
SU(4): %8:gr32 = MOV32rm %6:gr32_nosp, 1, $noreg, target-flags(x86-got) @c, $noreg :: (load (s32) from got)
# preds left : 1
# succs left : 1
# rdefs left : 0
Latency : 5
Depth : 2
Height : 10
Predecessors:
SU(1): Data Latency=1 Reg=%6
Successors:
SU(5): Data Latency=5 Reg=%8
Pressure Diff : GR16 -2
Single Issue : false;
SU(5): %1:gr32 = MOV32rm %8:gr32, 1, $noreg, 0, $noreg :: (dereferenceable load (s32) from @c)
# preds left : 1
# succs left : 1
# rdefs left : 0
Latency : 5
Depth : 7
Height : 5
Predecessors:
SU(4): Data Latency=5 Reg=%8
Successors:
SU(6): Data Latency=5 Reg=%1
Pressure Diff :
Single Issue : false;
SU(6): TEST32rr %1:gr32, %1:gr32, implicit-def $eflags
# preds left : 3
# succs left : 1
# weak succs left : 1
# rdefs left : 0
Latency : 1
Depth : 12
Height : 0
Predecessors:
SU(5): Data Latency=5 Reg=%1
SU(1): Out Latency=0
SU(3): Ord Latency=0 Artificial
Successors:
ExitSU: Ord Latency=0 Artificial
ExitSU: Ord Latency=0 Cluster
Pressure Diff :
Single Issue : false;
ExitSU: JCC_1 %bb.3, 5, implicit killed $eflags
# preds left : 2
# succs left : 0
# weak preds left : 1
# rdefs left : 0
Latency : 0
Depth : 12
Height : 0
Predecessors:
SU(6): Ord Latency=0 Artificial
SU(3): Ord Latency=4 Artificial
SU(6): Ord Latency=0 Cluster
SU(0): %11:gr32 = COPY killed $eax
# preds left : 0
# succs left : 1
# rdefs left : 0
Latency : 1
Depth : 0
Height : 1
Successors:
SU(1): Data Latency=1 Reg=%11
Single Issue : false;
SU(1): TEST32rr %11:gr32, %11:gr32, implicit-def $eflags
# preds left : 1
# succs left : 1
# weak succs left : 1
# rdefs left : 0
Latency : 1
Depth : 1
Height : 0
Predecessors:
SU(0): Data Latency=1 Reg=%11
Successors:
ExitSU: Ord Latency=0 Artificial
ExitSU: Ord Latency=0 Cluster
Single Issue : false;
ExitSU: JCC_1 %bb.7, 9, implicit killed $eflags
# preds left : 1
# succs left : 0
# weak preds left : 1
# rdefs left : 0
Latency : 0
Depth : 1
Height : 0
Predecessors:
SU(1): Ord Latency=0 Artificial
SU(1): Ord Latency=0 Cluster
SU(0): %17:gr32 = ADD32ri %17:gr32(tied-def 0), 4, implicit-def dead $eflags
# preds left : 0
# succs left : 1
# rdefs left : 0
Latency : 1
Depth : 0
Height : 0
Successors:
SU(1): Out Latency=0
Pressure Diff :
Single Issue : false;
SU(1): %18:gr32 = INC32r %18:gr32(tied-def 0), implicit-def $eflags
# preds left : 3
# succs left : 1
# weak succs left : 1
# rdefs left : 0
Latency : 1
Depth : 0
Height : 0
Predecessors:
SU(0): Out Latency=0
SU(2): Ord Latency=0 Artificial
SU(3): Ord Latency=0 Artificial
Successors:
ExitSU: Ord Latency=0 Artificial
ExitSU: Ord Latency=0 Cluster
Pressure Diff :
Single Issue : false;
SU(2): MOV32mi %1:gr32, 1, $noreg, 4, $noreg, 0 :: (store (s32) into %ir.1 + 4)
# preds left : 0
# succs left : 1
# rdefs left : 0
Latency : 1
Depth : 0
Height : 0
Successors:
SU(1): Ord Latency=0 Artificial
Pressure Diff :
Single Issue : false;
SU(3): MOV32mi %1:gr32, 1, $noreg, 0, $noreg, 0 :: (store (s32) into %ir.1)
# preds left : 0
# succs left : 1
# rdefs left : 0
Latency : 1
Depth : 0
Height : 0
Successors:
SU(1): Ord Latency=0 Artificial
Pressure Diff :
Single Issue : false;
ExitSU: JCC_1 %bb.5, 5, implicit killed $eflags
# preds left : 1
# succs left : 0
# weak preds left : 1
# rdefs left : 0
Latency : 0
Depth : 0
Height : 0
Predecessors:
SU(1): Ord Latency=0 Artificial
SU(1): Ord Latency=0 Cluster
Difference visible in RegAllocBase::allocatePhysRegs in case this helps:
# diff -u amd64.out i386.out
--- amd64.out 2024-07-21 08:07:50.522179000 +0000
+++ i386.out 2024-07-21 08:07:53.584991000 +0000
@@ -25,7 +25,7 @@
selectOrSplit GR32:%23 [64r,240B:0)[288B,400B:0)[400B,524r:1)[892r,928B:2) 0@64r 1@400B-phi 2@892r weight:5.352325e-02 w=5.352325e-02
-selectOrSplit GR32:%24 [524r,624B:0)[736B,768r:0)[768r,892r:1) 0@524r 1@768r weight:7.938738e-02 w=7.938738e-02
+selectOrSplit GR32:%24 [524r,624B:0)[736B,768r:0)[768r,892r:1) 0@524r 1@768r weight:7.938739e-02 w=7.938739e-02
selectOrSplit GR32:%18 [80r,240B:0)[288B,400B:0)[400B,624B:2)[736B,848r:2)[848r,928B:1) 0@80r 1@848r 2@400B-phi weight:5.647660e-02 w=5.647660e-02
@@ -33,12 +33,14 @@
selectOrSplit GR32:%26 [80r,240B:0)[288B,400B:0)[400B,532r:1)[900r,928B:2) 0@80r 1@400B-phi 2@900r weight:5.438390e-02 w=5.438390e-02
-selectOrSplit GR32:%27 [532r,624B:0)[736B,848r:0)[848r,900r:1) 0@532r 1@848r weight:7.938738e-02 w=7.938738e-02
+selectOrSplit GR32:%27 [532r,624B:0)[736B,848r:0)[848r,900r:1) 0@532r 1@848r weight:7.938739e-02 w=7.938739e-02
-selectOrSplit GR32:%27 [532r,624B:0)[736B,848r:0)[848r,900r:1) 0@532r 1@848r weight:7.938738e-02 w=7.938738e-02
+selectOrSplit GR32:%21 [520r,624B:0)[736B,888r:0) 0@520r weight:7.938738e-02 w=7.938738e-02
-selectOrSplit GR32:%28 [532r,624B:0)[736B,840r:0) 0@532r weight:5.871526e-02 w=5.871526e-02
+selectOrSplit GR32:%21 [520r,624B:0)[736B,888r:0) 0@520r weight:7.938738e-02 w=7.938738e-02
-selectOrSplit GR32:%29 [840r,848r:0)[848r,900r:1) 0@840r 1@848r weight:7.427752e-02 w=7.427752e-02
+selectOrSplit GR32:%28 [520r,624B:0)[736B,808r:0) 0@520r weight:6.075398e-02 w=6.075398e-02
-selectOrSplit GR32:%26 [80r,240B:0)[288B,400B:0)[400B,480r:1)[900r,928B:2) 0@80r 1@400B-phi 2@900r weight:5.438390e-02 w=5.438390e-02
+selectOrSplit GR32:%29 [808r,888r:0) 0@808r weight:7.118262e-02 w=7.118262e-02
+
+selectOrSplit GR32:%20 [112r,240B:0)[288B,400B:0)[400B,512r:1)[888r,928B:2) 0@112r 1@400B-phi 2@888r weight:5.674269e-02 w=5.674269e-02
Hmm, 7.938738e-02
vs 7.938739e-02
? Looks like some very slight floating point difference, so maybe this has to do with fpmath=sse versus fpmath=387 precision... :)
@cperciva I don't know which option you used to generate that output, but if I use -fsave-optimization-record
the resulting yaml files also show a difference in floating values:
--- out-amd64.opt.yaml 2024-07-21 12:16:29.629665000 +0200
+++ out-i386.opt.yaml 2024-07-21 12:17:40.275171000 +0200
@@ -19,10 +19,6 @@
Name: LoopSpillReloadCopies
Function: fn4
Args:
- - NumSpills: '1'
- - String: ' spills '
- - TotalSpillsCost: '8.730159e+00'
- - String: ' total spills cost '
- NumReloads: '1'
- String: ' reloads '
- TotalReloadsCost: '8.457341e+00'
@@ -36,7 +32,7 @@
Args:
- NumSpills: '1'
- String: ' spills '
- - TotalSpillsCost: '8.730159e+00'
+ - TotalSpillsCost: '5.371094e-01'
- String: ' total spills cost '
- NumReloads: '1'
- String: ' reloads '
@@ -127,6 +123,19 @@
Function: fn4
Args:
- String: 'BasicBlock: '
+ - BasicBlock: while.body.g.preheader
+ - String: "\n"
+ - String: 'movl '
+ - String: ': '
+ - INST_movl: '1'
+ - String: "\n"
+...
+--- !Analysis
+Pass: asm-printer
+Name: InstructionMix
+Function: fn4
+Args:
+ - String: 'BasicBlock: '
- BasicBlock: while.body.g
- String: "\n"
- String: 'pushl '
@@ -144,10 +153,6 @@
- String: j
- String: ': '
- INST_j: '1'
- - String: "\n"
- - String: 'movl '
- - String: ': '
- - INST_movl: '1'
- String: "\n"
- String: 'testl '
- String: ': '
@cperciva I don't know which option you used to generate that output
Sorry I should have clarified, that's from the line
LLVM_DEBUG(dbgs() << "\nselectOrSplit "
<< TRI->getRegClassName(MRI->getRegClass(VirtReg->reg()))
<< ':' << *VirtReg << " w=" << VirtReg->weight() << '\n');
except that I edited out LLVM_DEBUG
. I can't build a debug-enabled LLVM on i386 (the linker runs out of memory) so I resorted to editing the LLVM source code to enable debugging in specific locations.
Hmm,
7.938738e-02
vs7.938739e-02
? Looks like some very slight floating point difference, so maybe this has to do with fpmath=sse versus fpmath=387 precision... :)
That was my hunch even before I saw those numbers, just based on "difference between i386 and amd64 which shows up very very rarely". But the FPU is supposed to be set to double precision (not long double) so this shouldn't be an issue.
Confirmed, fpgetprec()
is returning 2
(double precision) so the x87 should be computing the same results as we get with fpmath=sse.
I've been debugging this into VirtRegAuxInfo::weightCalcHelper()
, where the float returned slightly differs between amd64 and i386. During the whole function, the TotalWeight
variable is exactly the same for both architectures, but at the end it returns normalize(TotalWeight, LI.getSize(), NumInstr)
.
Interestingly normalize
just calls normalizeSpillWeight
, which looks simple enough:
/// Normalize the spill weight of a live interval
///
/// The spill weight of a live interval is computed as:
///
/// (sum(use freq) + sum(def freq)) / (K + size)
///
/// @param UseDefFreq Expected number of executed use and def instructions
/// per function call. Derived from block frequencies.
/// @param Size Size of live interval as returnexd by getSize()
/// @param NumInstr Number of instructions using this live interval
static inline float normalizeSpillWeight(float UseDefFreq, unsigned Size,
unsigned NumInstr) {
// The constant 25 instructions is added to avoid depending too much on
// accidental SlotIndex gaps for small intervals. The effect is that small
// intervals have a spill weight that is mostly proportional to the number
// of uses, while large intervals get a spill weight that is closer to a use
// density.
return UseDefFreq / (Size + 25*SlotIndex::InstrDist);
}
In this case, SlotIndex::InstrDist
is 16 for both amd64 and i386, but still the function returns 7.938738e-02 on amd64, as opposed to 7.938739e-02 on i386...
Then I modified the debug prints to use hex floats, and the differences in printed float values disappeared! But also, the difference between the resultin assembly disappeared. :-)
Ooh, it might be a float-vs-double issue, not a double-vs-extended-double issue. I'm guessing fpmath=sse is using 32-bit registers to operate on floats, but the x87 is going to keep values as doubles.
Being a bit more careful shows that it's apparently in incoming difference in the UseDefFreq
parameter to normalize
, for example:
< normalizeSpillWeight: UseDefFreq=5.207812e+01 (0x1.a09ffep+5), ret=7.938738e-02 (0x1.452bb4p-4)
> normalizeSpillWeight: UseDefFreq=5.207812e+01 (0x1.a0ap+5), ret=7.938739e-02 (0x1.452bb6p-4)
The value normally printed by llvm's raw_ostream
is limited to 6 characters of decimal precision, but you can see that the hexfloat values are pretty close, though not equal.
So as soon as I put a few debug prints of the TotalWeight
variable in VirtRegAuxInfo::weightCalcHelper()
, the amd64/i386 differences disappear. It is definitely something in this particular function that is sensitive to the difference between the architectures.
In VirtRegAuxInfo::weightCalcHelper
it is the loop that starts with:
for (MachineRegisterInfo::reg_instr_nodbg_iterator
I = MRI.reg_instr_nodbg_begin(LI.reg()),
E = MRI.reg_instr_nodbg_end();
which introduces this issue. In that loop there is also the rather dodgy-looking part:
// Force hweight onto the stack so that x86 doesn't add hidden precision,
// making the comparison incorrectly pass (i.e., 1 > 1 == true??).
//
// FIXME: we probably shouldn't use floats at all.
volatile float HWeight = Hint[HintReg] += Weight;
if (HintReg.isVirtual() || MRI.isAllocatable(HintReg))
CopyHints.insert(CopyHint(HintReg, HWeight));
}
It looks like a similar hack might be needed for the local Weight
and TotalWeight
variables... Or indeed, as the comment indicates: probably shouldn't have used floats to begin with :)
So this tiny diff:
diff --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index 1d767a3484bc..849da3572efd 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -257,7 +257,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
return -1.0f;
}
- float Weight = 1.0f;
+ volatile float Weight = 1.0f;
if (IsSpillable) {
// Get loop info for mi.
if (MI->getParent() != MBB) {
makes the difference between amd64-hosted and i386-hosted output go away. I'm both satisfied and appalled. :)
Or alternatively, which I think is better and gets rid of the volatiles:
diff --git a/llvm/lib/CodeGen/CalcSpillWeights.cpp b/llvm/lib/CodeGen/CalcSpillWeights.cpp
index 1d767a3484bc..f40a1d6b0a62 100644
--- a/llvm/lib/CodeGen/CalcSpillWeights.cpp
+++ b/llvm/lib/CodeGen/CalcSpillWeights.cpp
@@ -163,7 +163,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
const TargetRegisterInfo &TRI = *MF.getSubtarget().getRegisterInfo();
const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
MachineBasicBlock *MBB = nullptr;
- float TotalWeight = 0;
+ double TotalWeight = 0;
unsigned NumInstr = 0; // Number of instructions using LI
SmallPtrSet<MachineInstr *, 8> Visited;
@@ -207,8 +207,8 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
// CopyHint is a sortable hint derived from a COPY instruction.
struct CopyHint {
const Register Reg;
- const float Weight;
- CopyHint(Register R, float W) : Reg(R), Weight(W) {}
+ const double Weight;
+ CopyHint(Register R, double W) : Reg(R), Weight(W) {}
bool operator<(const CopyHint &Rhs) const {
// Always prefer any physreg hint.
if (Reg.isPhysical() != Rhs.Reg.isPhysical())
@@ -221,7 +221,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
bool IsExiting = false;
std::set<CopyHint> CopyHints;
- DenseMap<unsigned, float> Hint;
+ DenseMap<unsigned, double> Hint;
for (MachineRegisterInfo::reg_instr_nodbg_iterator
I = MRI.reg_instr_nodbg_begin(LI.reg()),
E = MRI.reg_instr_nodbg_end();
@@ -257,7 +257,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
return -1.0f;
}
- float Weight = 1.0f;
+ double Weight = 1.0;
if (IsSpillable) {
// Get loop info for mi.
if (MI->getParent() != MBB) {
@@ -284,11 +284,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
Register HintReg = copyHint(MI, LI.reg(), TRI, MRI);
if (!HintReg)
continue;
- // Force hweight onto the stack so that x86 doesn't add hidden precision,
- // making the comparison incorrectly pass (i.e., 1 > 1 == true??).
- //
- // FIXME: we probably shouldn't use floats at all.
- volatile float HWeight = Hint[HintReg] += Weight;
+ double HWeight = Hint[HintReg] += Weight;
if (HintReg.isVirtual() || MRI.isAllocatable(HintReg))
CopyHints.insert(CopyHint(HintReg, HWeight));
}
@@ -309,7 +305,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
}
// Weakly boost the spill weight of hinted registers.
- TotalWeight *= 1.01F;
+ TotalWeight *= 1.01;
}
// If the live interval was already unspillable, leave it that way.
@@ -336,7 +332,7 @@ float VirtRegAuxInfo::weightCalcHelper(LiveInterval &LI, SlotIndex *Start,
// FIXME: this gets much more complicated once we support non-trivial
// re-materialization.
if (isRematerializable(LI, LIS, VRM, *MF.getSubtarget().getInstrInfo()))
- TotalWeight *= 0.5F;
+ TotalWeight *= 0.5;
if (IsLocalSplitArtifact)
return normalize(TotalWeight, Start->distance(*End), NumInstr);
Note the volatile
qualifier was introduced in 7af3432e22b04 by @dexonsmith, as being a "gross hack" which I agree with. :) Duncan, would you agree that switching this function to doubles is better? At least you can get rid of the volatile, then.
At least you can get rid of the volatile, then.
Only if the x87 is set to double precision. It is on FreeBSD but at least 20 years ago Linux was setting it to extended double by default.
Not saying we shouldn't make this change, but it would be good to confirm that it doesn't cause problems on other platforms. If @dexonsmith has a test case from 10 years ago, that is...
Well, I'm only trying a bunch of stopgaps here. The 7af3432e22b04522b63b159f0a07b2e9c4274a4a commit looks like something that was added for Apple, since it has a rdar:// URI, so my guess is that macOS at least behaves similarly in this area to FreeBSD... But indeed, I will do a test run on Linux too.
Note that if this were to become a pull request, I would very likely have to go over changed test cases too, since changing to double might slightly impact existing weights. In that sense, slapping in an additional volatile is likely to have the least impact.
Sounds good. I'd like to have something in place for FreeBSD -- not necessarily a patch which can be accepted into LLVM but something which fixes the situation for us -- in the next few weeks since this causes problems for FreeBSD release building and we have 13.4 coming up shortly. Do you think that's feasible?
I just did a full buildworld on both amd64 and i386 hosts, targeting i386, and now the differences between the i386.i386/lib/libarchive/archive_write_set_format_7zip.pico
files, and between the i386.i386/usr.bin/clang/lld/ELF/SyntheticSections.o
files (as noted in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276961#c3) have disappeared. I.e. these were both due to this register allocator issue, apparently.
That said, there is still another unexplained difference, the ipmon.o case:
--- /jails/containers/bug276961-amd64/usr/obj/usr/src/i386.i386/sbin/ipf/ipmon/ipmon.s 2024-05-07 21:09:50.385539000 +0200
+++ /jails/containers/bug276961-i386/usr/obj/usr/src/i386.i386/sbin/ipf/ipmon/ipmon.s 2024-05-07 21:10:26.083709000 +0200
@@ -19824,6 +19824,8 @@
calll free
.Ltmp1533:
addl $4, %esp
+ .loc 1 363 13 # src/sbin/ipf/ipmon/ipmon.c:363:13
+ movl $0, protocols
.Ltmp1534:
.LBB4_6: # %if.end6
.loc 1 365 23 # src/sbin/ipf/ipmon/ipmon.c:365:23
@@ -19990,6 +19992,8 @@
calll free
.Ltmp1565:
addl $4, %esp
+ .loc 1 387 13 # src/sbin/ipf/ipmon/ipmon.c:387:13
+ movl $0, udp_ports
.Ltmp1566:
.LBB4_22: # %if.end47
.loc 1 389 23 # src/sbin/ipf/ipmon/ipmon.c:389:23
0
but that is probably better reduced separately for another bug. It is still there, and seems to be caused by something else.
In any case, I guess slapping in the additional volatile
is probably the best approach to import into FreeBSD for now. It seems very safe to me. :)
Note the
volatile
qualifier was introduced in 7af3432e22b04 by @dexonsmith, as being a "gross hack" which I agree with. :) Duncan, would you agree that switching this function to doubles is better? At least you can get rid of the volatile, then.
Seems plausible to me. Or, could we use APFloat? I wonder if I tried that back then...
I did a check with llvmorg-19-init-18447-g
102d16809b2c on Debian 13 (kernel version 6.9.9), both x86_64 and i386, and there the outcome of VirtRegAuxInfo::weightCalcHelper
is the same for both architectures, even without a volatile qualifier. I'll have a look if any of the regression tests fall over due to different results, if I insert doubles instead of floats.
Ok, I tried the double
replacement approach, but that leads to quite a large number of test regressions. So I don't think that is the best way forward. APFloat
would be nice, but it is likely to open up yet another can of worms. :)
On FreeBSD we see a couple binaries which build differently for the i386 target depending on whether they are running on an amd64 host or on an "i386" host (in fact, i386 binaries running on an amd64 host).
I have narrowed it down to the Machine Instruction Scheduler step (running with
-mllvm -enable-misched=0
prevents the divergence between i386 and amd64 hosts); looking at the produced assembly code it looks like different choices are made (with very low probability -- a few times out of the entire OS build) in the register allocator. In my reproducer:Test case:
clang -fpic -O2 -S reduced.ll
with the reduced.ll.txt file produces i386-on-i386.asm.txt on an i386 host and i386-on-amd64.asm.txt on an amd64 host.Clang/LLVM versions:
The test case was reduced from
contrib/libarchive/libarchive/archive_write_set_format_7zip.c
(FreeBSD's included copy of libarchive).