rustc ThinLTO-generated IR yields dwarfdump error

alexcrichton commented 7 years ago


Bugzilla Link	35212
Resolution	FIXED
Resolved on	Dec 01, 2017 22:37
Version	trunk
OS	Windows NT
Attachments	minimized IR
CC	@adrian-prantl,@dwblaikie,@JDevlieghere,@jdm,@pogo59

Extended Description

This is a minimization of an upstream issue at https://github.com/rust-lang/rust/issues/45511 where the Rust compiler, when using ThinLTO, generates DWARF information that the dwarfdump program interprets as invalid.

I managed to minimize to the attached IR. The IR attached has been reduced via machines (creduce on Rust source code, bugpoint on LLVM IR) and also by hand. As a result this isn't really a direct translation of anything in Rust but rather an "artisinal" reduction which may or may not have invalidated the reduction itself along the way.

I unfortunately know very little about DWARF much less dwarfdump. I'm operating under the assumption that anything accepted by LLVM shouldn't generate an error with dwarfdump, but that may also be totally wrong! If so, please just let me know!

In any case, I can reproduce the error via:

$ llc -filetype=obj -O0 foo.ll -o foo.o $ dwarfdump -i foo.o > /dev/null

dwarfdump ERROR: reference form with no valid local ref?!, offset=<0x00000078>: DW_DLE_ATTR_FORM_OFFSET_BAD (119)

alexcrichton commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#35562

llvmbot commented 6 years ago

We've found another situation in which this occurs at https://github.com/rust-lang/rust/issues/46346. The patch at https://reviews.llvm.org/D39981 fixed the problem for abstract subprogram DIEs created in DwarfCompileUnit::constructAbstractSubprogramScopeDIE, but the problem can also occur within non-inlined subprogram DIEs within DwarfUnit::getOrCreateSubprogramDIE.

To recap, the problem is that when the subprogram DIE is created, it inherits the CU from the context DIE, even if the DISubprogram specifies a different CU. Then while creating attributes within that subprogram DIE tree, DwarfUnit::addDIEEntry assumes that unlinked DIES are within the current CU, and so it doesn't always use DW_FORM_ref_addr when it should.

The committed patch fixed the problem for DwarfCompileUnit::constructAbstractSubprogramScopeDIE by changing the "current CU" to be the same as the CU from the context DIE. This is a bit of a hack, and it's going to be messier to do the same thing for DwarfUnit::getOrCreateSubprogramDIE since we need to fix the notion of "current CU" in the callers as well.

Additionally, these non-inlined subprogram DIEs have address ranges associated with them, and DwarfDebug::endFunctionImpl is adding these ranges to the CU specified by the DISubprogram, which isn't going to match the CU the subprogram DIE is in.

So we're looking for ideas on how to fix this.

I'd like to return to the idea of emitting the subprograms in the CU specified in the IR. Adrian raised a concern with this:

Because it doesn't help in the general case: We have to support method C.A defined in unit U which depends on type U.T and method C.B in unit V that depends on type V.Q.

but I think that DwarfUnit::addDIEEntry already handles this case, assuming that its "current CU" is correct. Does that sound right? Assuming this is going to work, do you have any pointers on how to get the code to do that? I guess one way is to disable all sharing of type DIEs between CUs, but would there be any way to avoid that?

JDevlieghere commented 6 years ago

Committed in r318289

JDevlieghere commented 6 years ago

I've created a differential and added the reproducer as a test case: https://reviews.llvm.org/D39981

As a consequence of the change the type actually end up in the "correct" CU. This makes sense to me, as there's no reason anymore for it to be in the "wrong" one.

llvmbot commented 6 years ago

That'd be much appreciated Jonas. And perhaps make that comment clearer :)

JDevlieghere commented 6 years ago

Great, thanks for the clarification Philip! I guess I was a little confused by what you meant with 'the same CU` in your comment.

// The scope may be shared with a subprogram that has already been // constructed in another CU, in which case we need to construct this // subprogram in the same CU.

Based on your comment on GitHub, do you want me to create a diff for this so we can get this upstream?

llvmbot commented 6 years ago

The patch isn't changing which CU the subprogram is emitted in. It's actually doing Adrian's first suggestion.

The problem is that in constructAbstractSubprogramScopeDIE there is a mismatch between this and the CU of ContextDIE. this is the CU that was specified in the IR, but the CU of ContextDIE is that of the first subprogram that was emitted. The patch fixes this mismatch by looking up the CU of ContextDIE, and switching to use that. This doesn't affect which CU the subprogram is emitted in, because we're always using ContextDie as the parent. What it does mean is that when it gets to addDIEEntry, this will be correct, and so DW_FORM_ref_addr will be used when needed.

JDevlieghere commented 6 years ago

The patch essentially does what I proposed in my last comment: move the subprogram declaration into the other CU. Unfortunately this won't work in the general case, as Arian pointed out.

alexcrichton commented 6 years ago

Philip Craig has found out 1 that a relatively small patch 2 seems to fix the issue locally, but perhaps one of y'all who are more familiar could give it a once over to make sure it's on the right track? If it looks good we can try to get it upstream!

alexcrichton commented 6 years ago

A recent comment on the rust bug pointed out that we may have encountered this before with normal LTO builds, it just was mostly unnoticed until now! https://github.com/rust-lang/rust/issues/44412

adrian-prantl commented 6 years ago

Because it doesn't help in the general case: We have to support method C.A defined in unit U which depends on type U.T and method C.B in unit V that depends on type V.Q.

JDevlieghere commented 6 years ago

I don't think the problem is limited to ThinLTO. The same problem arises when llvm-link'ing the attached bitcode files together. There are already multiple CUs in the input bitcode file, which I'm guessing is how the rust compiler work (which I don't know anything about).

Adrian, can you remind me again why we didn't consider option (3) where we emit the declarations for the subprogram in the other CU (i.e. the one specified in the IR). It sounds like that would make things easier for split DWARF if we can avoid the cross-CU reference.

alexcrichton commented 6 years ago

Also, is there perhaps an "easy fix" that we could use in rustc? Something that we could hack around for now while waiting for an upstream fix?

For example would it be "easy" to, just after all the ThinLTO passes, we "fix" the IR to have everything point to the same codegen unit?

alexcrichton commented 6 years ago

Heh yeah I don't believe we're using split dwarf at all in Rust right now. With LTO we just use LLVM's LinkModules (I think that's the name?), afaik we don't do anything "nonstandard" for that at least.

dwblaikie commented 6 years ago

Oh interesting David! Should we be doing something different on our end to avoid problems like this perhaps?

Are you using Fission/Split DWARF?

If you aren't, then yeah - I imagine there's some multi-CU stuff that's less well tested.

Though I'm surprised these sorts of things wouldn't've come up in plain LTO... so I'm just a bit confused.

alexcrichton commented 6 years ago

Oh interesting David! Should we be doing something different on our end to avoid problems like this perhaps?

dwblaikie commented 6 years ago

For now, I thought I'd managed to thread the needle and avoid ThinLTO ever producing two CUs in a single .o (In part because Fission can't cope with that - but maybe I only implemented this for Fission... ah, yeah, -split-dwarf-cross-cu-references causes cross-CU inlined subroutines to be placed in the same unit as the source of the reference to avoid the cross-CU referencing under Split DWARF)

Yeah, this won't be a very well tested scenario at least on Google's side of ThinLTO, due to our predominant use of Fission, etc.

& yeah, I'm surprised this didn't come up with normal LTO?

adrian-prantl commented 6 years ago

0x00000000: Compile Unit: length = 0x00000132 version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x00000136) 0x00000078: DW_TAG_base_type DW_AT_name ("usize") DW_AT_encoding (DW_ATE_unsigned) DW_AT_byte_size (0x08)

<...> 0x00000136: Compile Unit: length = 0x00000066 version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x000001a0) <...> 0x00000171: DW_TAG_structure_type DW_AT_name ("RawVec<alloc::string::String, alloc::heap::Heap>") DW_AT_byte_size (0x10) DW_AT_alignment (8)

0x00000178: DW_TAG_subprogram DW_AT_linkage_name ("_ZN5alloc7raw_vec8{{impl}}52allocate_in<alloc::string::String,alloc::heap::Heap>E") DW_AT_name ("allocate_in<alloc::string::String,alloc::heap::Heap>") DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs") DW_AT_decl_line (82) DW_AT_external (true) DW_AT_APPLE_optimized (true) DW_AT_inline (DW_INL_inlined)

0x00000184: DW_TAG_subprogram DW_AT_linkage_name ("_ZN5alloc7raw_vec8{{impl}}36with_capacityE") DW_AT_name ("with_capacity") DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs") DW_AT_decl_line (140) DW_AT_external (true) DW_AT_APPLE_optimized (true) DW_AT_inline (DW_INL_inlined)

0x00000190: DW_TAG_formal_parameter DW_AT_name ("cap") DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs") DW_AT_decl_line (1) DW_AT_type (cu + 0x0078 "")

I think there are two things that we can/should do here:

It is essentially arbitrary in which CU the DW_TAG_structure_type is being emitted (and it depends on which DISubprogram we are visiting first). For that reason we should fix the code in LLVM's DWARF backend that decides to emit the DW_AT_type as a DW_FORM_ref4 and correctly determine that we need a DW_FORM_ref_addr instead. This can probably be achieved by comparing the parent DIEs of the formal parameter and the type.
There's an optimization potential for basic types: Basic types do not really belong to any specific CU and are typically very small, so we should consider emitting them into each CU that references them instead of generating a longer DW_FORM_ref_addr to reference them. While doing this would also fix the above testcase, it won't fix the the issue for anything other than basic types.

alexcrichton commented 6 years ago

Heh ok so this is where things could get a bit interesting. Unfortunately these are all massively reduced from the original IR which Rust tends to generate a lot of. I'll try to dump it all in though to see if that could help!

Unfortunately the files were a bit too large to upload here so I've made a gist [1]. There's two tarballs in that gist. First is inputs.tar.gz. These are all the input IR files to the ThinLTO backend. Basically rustc created all these codegen units for the crate in question (not a reduced crate, the original source) and then after it produced those modules and ran optimizations on them the next step was to run ThinLTO across all the modules.

I checked and a bunch of the object files ended up having invalid dwarf, but I decided to focus on one. The stages.tar.gz is the next most interesting tarball which contains the various stages of ThinLTO processing of the test.test0* module. This module's eventual object file was indeed invalid (according to dwarfdump), but the fun part was that each intermediate stage is valid!

The order of what happened here was:

thin-lto-input.ll -- input to ThinLTO
thin-lto-after-rename.ll -- after calling renameModuleForThinLTO
thin-lto-after-resolve.ll -- after calling thinLTOResolveWeakForLinkerModule
thin-lto-after-internalize.ll -- after calling thinLTOInternalizeModule
thin-lto-after-import.ll -- after using FunctionImporter
thin-lto-after-pm.ll -- after running ThinLTO optimization passes

Interestingly if you run all IR files through llc to generate an object file, they all generate object files that dwarfdump doesn't complain about except the last one, after-pm.ll. I verified this by just running llc $input -o foo.o -filetype=obj && dwarfdump -i foo.o > /dev/null, and only the after-pm.ll complained.

Even more interestingly I ran opt manually over after-import.ll and it didn't produce an invalid object file! The only difference between after-import.ll and after-pm.ll is after running ThinLTO optimization passes, but running the full set of optimization passes is apparently different enough that it correctly handles the dwarf info?

Unfortunately I don't know how to reproduce ThinLTO passes on the command line so I wasn't able to get a command line reproduction which takes after-import.ll and then modifies it to a state which is invalid (when passed through llc).

I hope that's not too much information! If I can help clarify anything here I'd be more than willing to help reduce and make sure that it can be debugged.

https://gist.github.com/alexcrichton/a7dbb0062491640d9d15d06ce10d34ef

JDevlieghere commented 6 years ago

If I understand correctly, the problem can be summarized as two DISubprogram that each have a different DICompileUnit, but share a scope end up in the same DWARF compile unit.

Assuming this is the result of (thin)LTO, I'd be interested to have a look at the IR before it's linked in. Maybe that will tell us why they are sharing a scope, which I think is the fundamental issue here.

alexcrichton commented 7 years ago

There's some discussion 1 on the upstream issue where it seems like LLVM may mostly assume that one LLVM module only has one codegen unit of debug information (dwarf-wise at least). I believe that the reason the multiple codegen units came into existence was because of ThinLTO (where ThinLTO inlined debug info presumably).

One possible fix 2 sounded like it may be to just inline debug info into the original codegen unit during ThinLTO inlining, but I'm not sure if that's (a) the best fix or (b) if it'd actually fix things.

alexcrichton commented 7 years ago

assigned to @JDevlieghere

llvm / llvm-project

rustc ThinLTO-generated IR yields dwarfdump error #34560

Extended Description