[DebugInfo] Malformed DWARF produced during LTO build

jmorse commented 3 years ago


Bugzilla Link	48790
Resolution	FIXED
Resolved on	Aug 23, 2021 14:00
Version	trunk
OS	Linux
Blocks	llvm/llvm-project#48661
Attachments	LLVM-IR to reproduce when fed to llc
CC	@adrian-prantl,@dwblaikie,@gregbedwell,@JDevlieghere,@kyulee-com,@modocache,@pogo59,@pogo59,@tstellar,@vedantk,@wjristow
Fixed by commit(s)	ef0dcb506300dc9644e8000c6028d14214be9d97

Extended Description

The reproducer below causes LLVM master (93592b726c7) to produce malformed DWARF when built with "-O3 -g -flto". Running llvm-dwarfdump -verify complains "invalid DIE reference", and our debugger becomes displeased. Reverting e08f205f / D70350, "Allow cross-CU references of subprogram definitions", fixes it for us.

Sources:

$ cat -n 1.cpp 1 struct HHH; 2 3 struct xxx { 4 HHH yyy(); 5 } zzz; 6

$ cat -n 2.cpp 1 int ggg; 2 3 struct HHH { 4 template int ccc(bbb) { 5 while (--ggg) 6 ; 7 } 8 }; 9 10 void eee() { 11 HHH fff; 12 fff.ccc([] {}); 13 } 14

Built thus (the target is important it seems): $ clang++ -O3 -g -flto -w -c 1.cpp -o 1.o -target x86_64-scei-ps4 $ clang++ -O3 -g -flto -w -c 2.cpp -o 2.o -target x86_64-scei-ps4 $ ld.lld 1.o 2.o -o 3.out -r $ llvm-dwarfdump -verify 3.out

Produces: --------8<-------- error: invalid DIE reference 0x00000046. Offset is in between DIEs:

0x000000a1: DW_TAG_inlined_subroutine DW_AT_abstract_origin (0x00000046) DW_AT_low_pc (0x0000000000000000) DW_AT_high_pc (0x0000000000000002) DW_AT_call_file ("1.cpp") DW_AT_call_line (12) -------->8--------

I'm attaching an IR reproducer, derived from "ld.lld -save-temps" and the combined-and-optimized IR it produces. Feeding that IR into llc also produces an invalid DIE reference.

I've done some prodding, and while to me the DWARF emission code is a twisty turny maze of passages all alike, I believe the problem is that the DIE for the "eee" function is placed in the wrong compile unit. Here's a chain of events:

The DIE for "HHH" is placed in the CU for 1.cpp, because the "zzz" global var is emitted there.
The DIE for "ccc" is created in the context of "HHH", CU is 1.cpp
The template argument for "ccc", a lambda type, is created
The scope for that template argument is needed: so a DIE "eee" is created.
That DIE for "eee" is in the "current" unit, 1.cpp

Normally this wouldn't be a problem. I believe that at the end of DwarfDebug::endFunctionImpl, normally a duplicate DIE would be created for "eee", because a cross-CU subprogram wouldn't be permitted. However, with e08f205f a cross-CU subprogram is permitted.

All the child scopes of "eee" are created before "eee" is created, they expect to be in the CU that the DISubprogram for "eee" specifies, 2.cpp, and they pick references of form "DW_FORM_ref4". They're then attached to a subprogram DIE in a different CU, breaking those forms of references. If you use llvm-dwarfdump -v, you can see that the offset of the DW_AT_abstract_origin above would be correct, if it was in the correct CU (hat-tip James Henderson).

I don't really have a proposal for how to fix this right now, DWARF emission is way out of my comfort zone.

tstellar commented 2 years ago

mentioned in issue llvm/llvm-project#48661

jmorse commented 3 years ago

Ultimately this was reverted in d4ce9e463d51 and cherry-picked into llvm-13 via bug 51426.

jmorse commented 3 years ago

Uploaded a revert-and-regression-test patch at https://reviews.llvm.org/D107076

jmorse commented 3 years ago

I suspect D94976 isn't going anywhere soon -- there seems to be a significant amount of context encoded in "the current unit" that I don't fully understand yet.

FTR, we've reverted D70350 downstream as it affects a codebase that's important to us. I'd suggest that we revert in llvm main, seeing how it's affecting more than one person/org, and broken DWARF is more of a blocker than reduced debug experience (alas).

vedantk commented 3 years ago

To respond to Jeremy's question (from comment #16): reverting D70350 might regress the debugging experience for some of our low-level projects that build with LTO, but it's not something to rule out, esp. if D94976 won't land any time soon (and, of course, given that malformed DWARF is no good).

kyulee-com commented 3 years ago

This bug still repros in the latest upstream compiler We've seen this case (the same verification error) frequently in LTO build for large applications with XCode. I don't have a cutdown but backing out e08f205f / D70350 seems to fix the issue for us. I wonder whether there is a follow-up on this bug?

tstellar commented 3 years ago

This bug was not resolved in time for the 12.0.0 release, so it will have to wait for 12.0.1.

If you feel this is a high-priority bug that should be fixed for 12.0.0, please re-add release-12.0.0 to the Blocks field and add a comment explaining why this is high-priority.

jmorse commented 3 years ago

Hi Tom,

What's the status of this bug? Have the problems with the fix been resolved?

They have not, and it seems unlikely that it'll be sufficiently addressed and tested in time for 12.0.

Technically this isn't a regression -- the behaviour behind this bug was present in llvm-11, and it appears to trigger rarely. We only ran into it on a very large C++ code base when using LTO. It can likely be left until 12.0.1, David agrees in comment 13

An alternative is to revert e08f205f, which unmasked this problem, and is what we're doing downstream. I believe that would decrease the quality of DWARF5 call site information, @aprantl @vsk would you be able to give an opinion on how you'd weight reverting versus sticking with the current behaviour?

tstellar commented 3 years ago

What's the status of this bug? Have the problems with the fix been resolved?

dwblaikie commented 3 years ago

Oh, I shuold also say it'd be complicated to emit a whole CU in a standalone way - in general CU IR Metadata doesn't contain a comprehensive list of its contents. In the interests of reducing/eliding dead DWARF, most of the entities refer to the CU as their context, but without the CU listing/knowing about that item.

eg: a simple inlining. CU1 and CU2 metadata exist but refer to no entities. SP2 refers to CU2 as its unit. SP1 refers to CP1 as its unit. There's an llvm::Function func2. func2's subprogram field refers to SP2. some instructions in func2 have a debug location that refers to SP1

So the only way SP1 is reachable is through func2

dwblaikie commented 3 years ago

I'm getting the impression that cross-CU DIE emission is being done in an ad-hoc on-demand way, and keeping the contexts straight runs into all sorts of corner cases. Would we be better off reorganizing it so we emit one CU at a time,

Practically, we can't - CUs are built as functions are code generated - and not all the functions for a given CU are grouped together. I guess we could change that, but it'd be a significant redesign of DWARF building code & I'm not sure it'd significantly improve this issue.

and keep some sort of side/work list of 'stuff some other CU needs'? This would require deferring filling in the actual value of the reference because we don't know the section offset yet;

Perhaps an aside, but FWIW, that part of the architecture already exists - we don't compute the DIE offsets immediately (because we go back and add more attributes to DIEs later on, for instance - so we can't know offsets) - so when we created a cross-DIE reference, one DIE's attribute refers/points to the other DIE then later on when they're emitted it's two-pass, one commits offsets (says "we're not changing anything, so figure out how big everything is/what the offsets are) and then emitting the bytes where the attribute emission can query the final offset of the DIE being referred to and use that.

either as something to be backpatched when we get around to emitting the other DIE, or by using relocations and letting the linker deal with it.

There are some other benefits if we were to move to using relocations for a lot more DWARF output, including for DIE offsets. (would make for more editable/legible DWARF assembly) but probably not much difference to this situation.

Sitrep: D94976 continues to run into trouble. There's an increased risk that this won't make llvm-12. Technically this isn't a regression because the buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal, of course.

Yep, I think it'd be best to not try to push this into LLVM 12 - as we've seen, there's a lot of nuance in this sort of code.

pogo59 commented 3 years ago

I'm getting the impression that cross-CU DIE emission is being done in an ad-hoc on-demand way, and keeping the contexts straight runs into all sorts of corner cases. Would we be better off reorganizing it so we emit one CU at a time, and keep some sort of side/work list of 'stuff some other CU needs'? This would require deferring filling in the actual value of the reference because we don't know the section offset yet; either as something to be backpatched when we get around to emitting the other DIE, or by using relocations and letting the linker deal with it. The cross-CU references would all be DW_FORM_ref_addr so the size is known.

jmorse commented 3 years ago

Sitrep: D94976 continues to run into trouble. There's an increased risk that this won't make llvm-12. Technically this isn't a regression because the buggy behaviour is in llvm-11, but emitting broken DWARF is far from ideal, of course.

jmorse commented 3 years ago

Hi Tom, could you pull-up / cherry-pick the fix for this ticket

Actually, hold on this, there's a report in D95622 that it's causing a crash. I'd still like this to be in llvm-12, but best to wait a little bit.

jmorse commented 3 years ago

Hi Tom, could you pull-up / cherry-pick the fix for this ticket, ef0dcb50630 [0], into the LLVM 12 release branch? In rare circumstances LTO will produce a file with illegal DWARF syntax.

I don't think it's needed for rc1.

[0] https://reviews.llvm.org/rGef0dcb506300dc9644e8000c6028d14214be9d97

dwblaikie commented 3 years ago

& yeah, nothing to do with ctor or other (vtable, etc) homing strategies - IR debug info metadata type deduplication would be done even with -fstandalone-debug (ie: without any type homing enabled).

For what it's worth: Type homing is a purely frontend/IR generation feature and LLVM has no knowledge of or explicit support for type homing.

IR Type deduplication does rely on the frontend providing a mangled name in the type description metadata - this is how the frontend signals to LLVM that the type is subject to the ODR and so LLVM can drop duplicates based on name alone.

dwblaikie commented 3 years ago

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me. Compiled separately, 1.cpp would obviously have only a forward declaration. I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in 1.cpp? Possibly a side-effect of ctor homing? (I haven't looked at the insides of this at all, I'm just doing my usual black-box guessing based on the symptoms.)

Type definitions (& declarations) are deduplicated at the LLVM IR level based on their mangled name to reduce IR (& resulting DWARF) size. This is expected/desired - perhaps slightly not-what-the-source-says (but that's always going to be the case - if the type was defined in both files, we still wouldn't want to produce definitions in both due to size concerns).

jmorse commented 3 years ago

Possible patch: https://reviews.llvm.org/D94976

jmorse commented 3 years ago

(CC'ing modocache as he reported reverting D70350 downstream after experiencing difficulties with it).

jmorse commented 3 years ago

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me. Compiled separately, 1.cpp would obviously have only a forward declaration.

It's the presence of the global that's causing it to be emitted in 1.cpp. I imagine something similar to the subprogram happens to the type: it just gets created in the "current" CU.

I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in 1.cpp?

Yup,

Possibly a side-effect of ctor homing?

I don't believe ctor homing is turned on by default yet, I see -debug-info-kind=limited and -fuse-ctor-homing is absent in my clang -### commands.

pogo59 commented 3 years ago

The definition of HHH migrating from 2.cpp to 1.cpp seems odd to me. Compiled separately, 1.cpp would obviously have only a forward declaration. I take it with LTO, 2.cpp ends up with a cross-CU reference to the HHH in 1.cpp? Possibly a side-effect of ctor homing? (I haven't looked at the insides of this at all, I'm just doing my usual black-box guessing based on the symptoms.)

dwblaikie commented 3 years ago

Yep, looks like a bug to me.

Slightly smaller repro (& also reproduces on x86, and with full (not only -r) linking):

1.cpp: struct HHH; HHH *zzz;

2.cpp: void attribute((optnone)) attribute((nodebug)) f1() { }

struct HHH { template static int attribute((always_inline)) ccc() { f1(); } };

int main() { struct local { }; HHH::ccc(); }

As for fixing it, I think the most likely culprit/place to start would be the logic that places the definition of "eee" (or "main" in my example) in the wrong CU. I don't think there's any good reason for that to happen.

(I was hoping to tickle a possibly more severe version of this bug by making eee file-scoped static (because then moving it between CUs could be really problematic in other ways - if there was another static function with the same name in that other CU, etc) but seems something different happens in that case that doesn't trigger the bug - not sure why, don't think we do anything significantly different in the DWARF for file-scoped static entities)

llvm / llvm-project

[DebugInfo] Malformed DWARF produced during LTO build #48134

Extended Description