PeepholeOptimizerPass loses debug info

khuey commented 2 months ago

I've written a lengthy explanation in rust-lang/rust#129434 of one case where this happens, but the short version is that the peephole pass moves operands around and doesn't try to preserve their debug information. If one of those operands appears in a debug-instr-ref, the value will be lost.

Fixing this appears complicated. I think we want something like the diff at the bottom, but the part that's unclear to me is how to represent the replacement operand. It seems like perhaps we need an equivalent to DebugOperandMemNumber (or to extend DebugOperandMemNumber itself beyond stack spills) for loads.

Any thoughts @jmorse? What InstrRef is doing is pretty complicated so perhaps I'm missing a simpler solution.

diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp
index 39ba7ea77790..151bc67f85b3 100644
--- a/llvm/lib/Target/X86/X86InstrInfo.cpp
+++ b/llvm/lib/Target/X86/X86InstrInfo.cpp
@@ -7392,17 +7392,18 @@ unsigned X86InstrInfo::commuteOperandsForFold(MachineInstr &MI,
 static void printFailMsgforFold(const MachineInstr &MI, unsigned Idx) {
   if (PrintFailedFusing && !MI.isCopy())
     dbgs() << "We failed to fuse operand " << Idx << " in " << MI;
 }

 MachineInstr *X86InstrInfo::foldMemoryOperandImpl(
     MachineFunction &MF, MachineInstr &MI, unsigned OpNum,
     ArrayRef<MachineOperand> MOs, MachineBasicBlock::iterator InsertPt,
-    unsigned Size, Align Alignment, bool AllowCommute) const {
+    unsigned Size, Align Alignment, bool AllowCommute,
+    std::optional<MachineFunction::DebugInstrOperandPair> InstrNum) const {
   bool isSlowTwoMemOps = Subtarget.slowTwoMemOps();
   unsigned Opc = MI.getOpcode();

   // For CPUs that favor the register form of a call or push,
   // do not fold loads into calls or pushes, unless optimizing for size
   // aggressively.
   if (isSlowTwoMemOps && !MF.getFunction().hasMinSize() &&
       (Opc == X86::CALL32r || Opc == X86::CALL64r || Opc == X86::PUSH16r ||
@@ -7494,16 +7495,19 @@ MachineInstr *X86InstrInfo::foldMemoryOperandImpl(
       // value and zero-extend the top bits. Change the destination register
       // to a 32-bit one.
       Register DstReg = NewMI->getOperand(0).getReg();
       if (DstReg.isPhysical())
         NewMI->getOperand(0).setReg(RI.getSubReg(DstReg, X86::sub_32bit));
       else
         NewMI->getOperand(0).setSubReg(X86::sub_32bit);
     }
+    if (InstrNum) {
+      MF.makeDebugValueSubstitution(*InstrNum, {NewMI->getDebugInstrNum(), /* XXX what goes here?! */});
+    }
     return NewMI;
   }

   if (AllowCommute) {
     // If the instruction and target operand are commutable, commute the
     // instruction and try again.
     unsigned CommuteOpIdx2 = commuteOperandsForFold(MI, OpNum);
     if (CommuteOpIdx2 == OpNum) {
@@ -8281,18 +8285,23 @@ MachineInstr *X86InstrInfo::foldMemoryOperandImpl(
       return nullptr;

     // Folding a normal load. Just copy the load's address operands.
     MOs.append(LoadMI.operands_begin() + NumOps - X86::AddrNumOperands,
                LoadMI.operands_begin() + NumOps);
     break;
   }
   }
+  std::optional<MachineFunction::DebugInstrOperandPair> InstrNum;
+  if (unsigned Num = LoadMI.peekDebugInstrNum()) {
+    InstrNum = {Num, 0};
+  }
   return foldMemoryOperandImpl(MF, MI, Ops[0], MOs, InsertPt,
-                               /*Size=*/0, Alignment, /*AllowCommute=*/true);
+                               /*Size=*/0, Alignment, /*AllowCommute=*/true,
+                               InstrNum);
 }

 MachineInstr *
 X86InstrInfo::foldMemoryBroadcast(MachineFunction &MF, MachineInstr &MI,
                                   unsigned OpNum, ArrayRef<MachineOperand> MOs,
                                   MachineBasicBlock::iterator InsertPt,
                                   unsigned BitsSize, bool AllowCommute) const {

llvmbot commented 2 months ago

@llvm/issue-subscribers-debuginfo

Author: Kyle Huey (khuey)

I've written a lengthy explanation in rust-lang/rust#129434 of one case where this happens, but the short version is that the peephole pass moves operands around and doesn't try to preserve their debug information. If one of those operands appears in a debug-instr-ref, the value will be lost. Fixing this appears complicated. I think we want something like the diff at the bottom, but the part that's unclear to me is how to represent the replacement operand. It seems like perhaps we need an equivalent to `DebugOperandMemNumber` (or to extend `DebugOperandMemNumber` itself beyond stack spills) for loads. Any thoughts @jmorse? What InstrRef is doing is pretty complicated so perhaps I'm missing a simpler solution. ```diff diff --git a/llvm/lib/Target/X86/X86InstrInfo.cpp b/llvm/lib/Target/X86/X86InstrInfo.cpp index 39ba7ea77790..151bc67f85b3 100644 --- a/llvm/lib/Target/X86/X86InstrInfo.cpp +++ b/llvm/lib/Target/X86/X86InstrInfo.cpp @@ -7392,17 +7392,18 @@ unsigned X86InstrInfo::commuteOperandsForFold(MachineInstr &MI, static void printFailMsgforFold(const MachineInstr &MI, unsigned Idx) { if (PrintFailedFusing && !MI.isCopy()) dbgs() << "We failed to fuse operand " << Idx << " in " << MI; } MachineInstr *X86InstrInfo::foldMemoryOperandImpl( MachineFunction &MF, MachineInstr &MI, unsigned OpNum, ArrayRef<MachineOperand> MOs, MachineBasicBlock::iterator InsertPt, - unsigned Size, Align Alignment, bool AllowCommute) const { + unsigned Size, Align Alignment, bool AllowCommute, + std::optional<MachineFunction::DebugInstrOperandPair> InstrNum) const { bool isSlowTwoMemOps = Subtarget.slowTwoMemOps(); unsigned Opc = MI.getOpcode(); // For CPUs that favor the register form of a call or push, // do not fold loads into calls or pushes, unless optimizing for size // aggressively. if (isSlowTwoMemOps && !MF.getFunction().hasMinSize() && (Opc == X86::CALL32r || Opc == X86::CALL64r || Opc == X86::PUSH16r || @@ -7494,16 +7495,19 @@ MachineInstr *X86InstrInfo::foldMemoryOperandImpl( // value and zero-extend the top bits. Change the destination register // to a 32-bit one. Register DstReg = NewMI->getOperand(0).getReg(); if (DstReg.isPhysical()) NewMI->getOperand(0).setReg(RI.getSubReg(DstReg, X86::sub_32bit)); else NewMI->getOperand(0).setSubReg(X86::sub_32bit); } + if (InstrNum) { + MF.makeDebugValueSubstitution(*InstrNum, {NewMI->getDebugInstrNum(), /* XXX what goes here?! */}); + } return NewMI; } if (AllowCommute) { // If the instruction and target operand are commutable, commute the // instruction and try again. unsigned CommuteOpIdx2 = commuteOperandsForFold(MI, OpNum); if (CommuteOpIdx2 == OpNum) { @@ -8281,18 +8285,23 @@ MachineInstr *X86InstrInfo::foldMemoryOperandImpl( return nullptr; // Folding a normal load. Just copy the load's address operands. MOs.append(LoadMI.operands_begin() + NumOps - X86::AddrNumOperands, LoadMI.operands_begin() + NumOps); break; } } + std::optional<MachineFunction::DebugInstrOperandPair> InstrNum; + if (unsigned Num = LoadMI.peekDebugInstrNum()) { + InstrNum = {Num, 0}; + } return foldMemoryOperandImpl(MF, MI, Ops[0], MOs, InsertPt, - /*Size=*/0, Alignment, /*AllowCommute=*/true); + /*Size=*/0, Alignment, /*AllowCommute=*/true, + InstrNum); } MachineInstr * X86InstrInfo::foldMemoryBroadcast(MachineFunction &MF, MachineInstr &MI, unsigned OpNum, ArrayRef<MachineOperand> MOs, MachineBasicBlock::iterator InsertPt, unsigned BitsSize, bool AllowCommute) const { ```

jmorse commented 2 months ago

Firstly, thanks for digging so deeply into this, you've really served the problem up in a completely-analyzed way, which is super-appreciated. For the record we're happy to lend a hand studying any variable location weirdness's that crop up, it's very much our area of focus (with @SLTozer @OCHyams @CarlosAlbertoEnciso ).

I think I tried something vaguely similar to the diff you propose a few years ago but ran into some code that speculatively tries to fold things, but can then back them out later, and there was a risk of it becoming a mess. Given that instruction-referencing has settled in, it should be possible to make some adjustments and do it better.

However, I'd like to know more about the exact use case -- if you can produce the LLVM MIR code for the function, before LiveDebugValues runs, that'd be fantastic. Seeing how the subtract / compare instruction doesn't define the 8 bytes you're concerned with, technically the peephole optimisation shouldn't affect it. We should instead have an operand / value definition somewhere else that gets spilled to the stack, and the only difference from the peephole is that it isn't loaded to a register any more. LiveDebugValues Should (TM) (R) be able to produce a location on the stack instead, and it could be that something is going wrong there.

khuey commented 2 months ago

I think this is the MIR you're asking for: mir.txt

"s" is the variable of interest here. The CMP64mi32 that loads from %ir.11 is the fused instruction out of PeepholeOptimizerPass.

jmorse commented 2 months ago

This is especially interesting because it's clearly a stack access happening, but the memory-operand is a Value, %ir.11. IIRC, this means there's some memory allocated on the stack where the address has been leaked, thus the loads and stores are potentially aliased by other pointers.

This becomes tricky, because LLVM tries to avoid producing variable locations that reference the heap, or that can be aliased. It's undesirable for developers to have a local variable which changes value after a store through a pointer, as that doesn't reflect the programs original semantics. IIRC in this specific circumstance (the peephole optimiser, with a memory-operand of %ir.11) we can't easily determine that the source-address is on the stack, and that would be the natural place to discard the variable location as undesirable/too hard to recover.

Instead, I think the solution should come at a higher level. @OCHyams produced an alternative way of describing stack-variables that have their addresses leaked ("assignment tracking") that collates the alloca storing the variable with the stores to it, and tries to pick whether a memory or register location should be used at different parts of the function. Would it be acceptable in this circumstance to emit a stack location of $rsp+0x30? That would contain all the variable fragments being loaded to registers, if I understand correctly.

Would you be able to upload the IR as first seen by LLVM: it might be that we just need to extend assignment-tracking a little, or perhaps there's a bug in it. (I've zero familiarity with rusts debug-info, and "rustc-1.76" on Ubuntu doesn't seem to reproduce it.

khuey commented 2 months ago

I think this is the IR you're asking for, though I don't know how useful it will be. initial-ir.txt Several levels of inlining happened to get to the final result.

jmorse commented 2 months ago

Hmmmm. I fed the IR into clang -O2, and the two most interesting passes are:

BISECT: running pass (1475) InlinerPass on (_ZN4core4iter6traits8iterator8Iterator3any5check28_$u7b$$u7b$closure$u7d$$u7d$17hfdc3c73c33063221E)
BISECT: running pass (1478) SROAPass on _ZN4core4iter6traits8iterator8Iterator3any5check28_$u7b$$u7b$closure$u7d$$u7d$17hfdc3c73c33063221E

For the inlining part, a #dbg_declare for the "s" variable gets inlined into the function named there. There's a 24 byte alloca named %_6 that it points at. We'd normally expect this to be the point where the #dbg_declare gets lowered to a bunch of #dbg_values, because the alloca address isn't leaked and we can see all loads and stores. SROA does exactly that.

However at the same time, SROA elides that alloca away because it's filled by a memcpy from the function argument %x, the signature being

define hidden noundef zeroext i1 @"_ZN4core4iter6traits8iterator8Iterator3any5check28_$u7b$$u7b$closure$u7d$$u7d$17hfdc3c73c33063221E"(ptr noalias nocapture noundef nonnull readnone align 1 %_1, ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %x) unnamed_addr #4 personality ptr @rust_eh_personality !dbg !1597

Thus: I think what's happening here is that at one point SROA is behaving like it's optimising a normal local variable during debug-info handling, but then at another point it replaces the storage with the argument pointer. Which explains why assignment-tracking can't deal with it and prematurely relaxes some information. Additionally, I think you only get assignment-tracking in clang nowadays, rust would need to enable that manually I suppose.

@OCHyams does that make any sense to you, and is there any additional strangeness you can see? If we got this working, I expect we'd also have to deal with allocas potentially having multiple variables names bound to them, something we've steered clear of in the past.

I'm attaching the IR from before SROA, with all other functions stripped out, it's %_6 that's the alloca in question and the declare is #dbg_declare(ptr %_6, !1673, !DIExpression(), !1677). The variable is !1673.

jryans commented 2 months ago

Additionally, I think you only get assignment-tracking in clang nowadays, rust would need to enable that manually I suppose.

I forget where this may have been discussed before, apologies for potential repetition... Why is assignment tracking a Clang-level feature? Should we open an issue move the pass into the default LLVM pipeline so that it's available to all downstream LLVM users without extra steps?

jmorse commented 2 months ago

I forget where this may have been discussed before, apologies for potential repetition... Why is assignment tracking a Clang-level feature? Should we open an issue move the pass into the default LLVM pipeline so that it's available to all downstream LLVM users without extra steps?

Recalling when it was originally deployed, there were good coverage benefits. Unfortunately LTO got measurably slower, and to be conservative we only enabled AT on pipelines where we knew it was efficient. So, there's no technical reason why it couldn't be enabled in the default pipeline, aside from the efficiency matters and test coverage. Tickets / issues to complete that make sense.

OCHyams commented 2 months ago

Thus: I think what's happening here is that at one point SROA is behaving like it's optimising a normal local variable during debug-info handling, but then at another point it replaces the storage with the argument pointer. Which explains why assignment-tracking can't deal with it and prematurely relaxes some information. Additionally, I think you only get assignment-tracking in clang nowadays, rust would need to enable that manually I suppose.

@OCHyams does that make any sense to you, and is there any additional strangeness you can see?

Mechanically, the reason the variable "s" isn't tracked using assignment tracking when you feed the IR into Clang is that "s"'s storage is not local to the function (we get a pointer passed in) - assignment tracking needs extra work to handle that.

If we got this working, I expect we'd also have to deal with allocas potentially having multiple variables names bound to them, something we've steered clear of in the past.

Good point. It's not quite the same thing but this definitely happens for structured bindings in C++, albeit having each constituent variable set at a different offset into the alloca (though of course the debug info for those was broken for a long time!), so there's some vague precedence but the variables there don't overlap (I think we used to emit an artificial aggregate variable for the whole alloca too which did overlap the parts, but I don't know how well that was handled through llvm). I can't think of a fundamental reason why this wouldn't work, but thinking about it, it possibly breaks some assumptions in the various remove-redundant-debuginfo utilities. The more I think about it the more I worry about it... I don't think there's any fundamental problem, it "just" may require some implementation changes to various pass/utilities?

khuey commented 2 months ago

Perhaps I'm missing something but "extend a different pass that isn't even used outside of clang and that might cause performance regressions" isn't the most satisfying answer. Is there a reason we can't fix PeepholeOptimizerPass?

The two fused instructions are:

  %30:gr64 = MOV64rm %stack.0, 1, $noreg, 16, $noreg, debug-instr-number 2, debug-location !928 :: (dereferenceable load (s64) from %ir.11, !noalias !922)
...
  %32:gr64 = SUB64ri32 %30:gr64(tied-def 0), 13, implicit-def $eflags, debug-location !1019

And they're fused to

  CMP64mi32 %stack.0, 1, $noreg, 16, $noreg, 13, implicit-def $eflags, debug-location !1019 :: (dereferenceable load (s64) from %ir.11, !noalias !922)

Conceptually it seems straightforward to do an operand substitution, no?

jmorse commented 2 months ago

Short summary, it'll fix some functions but not the reproducer you have.

At a mechanical level, at this point in compilation we've already thrown away the most useful information. The machine-memory-operand starts with %ir, indicating it's connected to an LLVM-IR Value, instead of a PseudoSourceValue object. It's thus moderately more complicated to recover the stack location, or even know that the location is on the stack, without more deeper wrestling with the target-machine APIs.

That's technically achievable, however the more major issue is that as the load originates from an LLVM-IR Value, it indicates not all stores to that memory are visible in the function, due to the pointer to the memory being leaked. The final part of instruction-referencing in LiveDebugValues will try to compute what value is in what stack slots at different parts of the function, but it'll be incorrect some of the time if aliasing stores occur or other functions store to that memory. We prioritise dropping false information over presenting possibly-true information as any false information presented to developers damages their confidence, hence LiveDebugValues doesn't even try for memory with leaked pointers.

There are many different views on what to do with leaked pointers (many thrashed out in https://bugs.llvm.org/show_bug.cgi?id=34136 ), and the solution we came up with was assignment-tracking, which is able to make better choices about whether a variable should live on the stack or in a register at various points. It can maintain information about both through optimisation, where other LLVM facilities pick either always-stack or always-registers.

The barriers to using that are the ticket that JRyans has opened, and the matter with SROA behaving oddly, which might be just a simple bug. IIRC we focused on local variables that were leaked, and then potentially un-leaked after inlining -- wheras if I understand the reproducer correctly, it's the opposite, where a declared variable only gets connected to an LLVM alloca after inlining pulls it into a caller-function.

khuey commented 2 months ago

Thanks, that's helpful.

llvm / llvm-project

PeepholeOptimizerPass loses debug info #105881