Closed alexcrichton closed 2 years ago
mentioned in issue llvm/llvm-bugzilla-archive#35562
We've found another situation in which this occurs at https://github.com/rust-lang/rust/issues/46346. The patch at https://reviews.llvm.org/D39981 fixed the problem for abstract subprogram DIEs created in DwarfCompileUnit::constructAbstractSubprogramScopeDIE, but the problem can also occur within non-inlined subprogram DIEs within DwarfUnit::getOrCreateSubprogramDIE.
To recap, the problem is that when the subprogram DIE is created, it inherits the CU from the context DIE, even if the DISubprogram specifies a different CU. Then while creating attributes within that subprogram DIE tree, DwarfUnit::addDIEEntry assumes that unlinked DIES are within the current CU, and so it doesn't always use DW_FORM_ref_addr when it should.
The committed patch fixed the problem for DwarfCompileUnit::constructAbstractSubprogramScopeDIE by changing the "current CU" to be the same as the CU from the context DIE. This is a bit of a hack, and it's going to be messier to do the same thing for DwarfUnit::getOrCreateSubprogramDIE since we need to fix the notion of "current CU" in the callers as well.
Additionally, these non-inlined subprogram DIEs have address ranges associated with them, and DwarfDebug::endFunctionImpl is adding these ranges to the CU specified by the DISubprogram, which isn't going to match the CU the subprogram DIE is in.
So we're looking for ideas on how to fix this.
I'd like to return to the idea of emitting the subprograms in the CU specified in the IR. Adrian raised a concern with this:
Because it doesn't help in the general case: We have to support method C.A defined in unit U which depends on type U.T and method C.B in unit V that depends on type V.Q.
but I think that DwarfUnit::addDIEEntry already handles this case, assuming that its "current CU" is correct. Does that sound right? Assuming this is going to work, do you have any pointers on how to get the code to do that? I guess one way is to disable all sharing of type DIEs between CUs, but would there be any way to avoid that?
Committed in r318289
I've created a differential and added the reproducer as a test case: https://reviews.llvm.org/D39981
As a consequence of the change the type actually end up in the "correct" CU. This makes sense to me, as there's no reason anymore for it to be in the "wrong" one.
That'd be much appreciated Jonas. And perhaps make that comment clearer :)
Great, thanks for the clarification Philip! I guess I was a little confused by what you meant with 'the same CU` in your comment.
// The scope may be shared with a subprogram that has already been // constructed in another CU, in which case we need to construct this // subprogram in the same CU.
Based on your comment on GitHub, do you want me to create a diff for this so we can get this upstream?
The patch isn't changing which CU the subprogram is emitted in. It's actually doing Adrian's first suggestion.
The problem is that in constructAbstractSubprogramScopeDIE there is a mismatch between this
and the CU of ContextDIE. this
is the CU that was specified in the IR, but the CU of ContextDIE is that of the first subprogram that was emitted. The patch fixes this mismatch by looking up the CU of ContextDIE, and switching to use that. This doesn't affect which CU the subprogram is emitted in, because we're always using ContextDie as the parent. What it does mean is that when it gets to addDIEEntry, this
will be correct, and so DW_FORM_ref_addr will be used when needed.
The patch essentially does what I proposed in my last comment: move the subprogram declaration into the other CU. Unfortunately this won't work in the general case, as Arian pointed out.
A recent comment on the rust bug pointed out that we may have encountered this before with normal LTO builds, it just was mostly unnoticed until now! https://github.com/rust-lang/rust/issues/44412
Because it doesn't help in the general case: We have to support method C.A defined in unit U which depends on type U.T and method C.B in unit V that depends on type V.Q.
I don't think the problem is limited to ThinLTO. The same problem arises when llvm-link'ing the attached bitcode files together. There are already multiple CUs in the input bitcode file, which I'm guessing is how the rust compiler work (which I don't know anything about).
Adrian, can you remind me again why we didn't consider option (3) where we emit the declarations for the subprogram in the other CU (i.e. the one specified in the IR). It sounds like that would make things easier for split DWARF if we can avoid the cross-CU reference.
Also, is there perhaps an "easy fix" that we could use in rustc? Something that we could hack around for now while waiting for an upstream fix?
For example would it be "easy" to, just after all the ThinLTO passes, we "fix" the IR to have everything point to the same codegen unit?
Heh yeah I don't believe we're using split dwarf at all in Rust right now. With LTO we just use LLVM's LinkModules
(I think that's the name?), afaik we don't do anything "nonstandard" for that at least.
Oh interesting David! Should we be doing something different on our end to avoid problems like this perhaps?
Are you using Fission/Split DWARF?
If you aren't, then yeah - I imagine there's some multi-CU stuff that's less well tested.
Though I'm surprised these sorts of things wouldn't've come up in plain LTO... so I'm just a bit confused.
Oh interesting David! Should we be doing something different on our end to avoid problems like this perhaps?
For now, I thought I'd managed to thread the needle and avoid ThinLTO ever producing two CUs in a single .o (In part because Fission can't cope with that - but maybe I only implemented this for Fission... ah, yeah, -split-dwarf-cross-cu-references causes cross-CU inlined subroutines to be placed in the same unit as the source of the reference to avoid the cross-CU referencing under Split DWARF)
Yeah, this won't be a very well tested scenario at least on Google's side of ThinLTO, due to our predominant use of Fission, etc.
& yeah, I'm surprised this didn't come up with normal LTO?
0x00000000: Compile Unit: length = 0x00000132 version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x00000136) 0x00000078: DW_TAG_base_type DW_AT_name ("usize") DW_AT_encoding (DW_ATE_unsigned) DW_AT_byte_size (0x08)
<...> 0x00000136: Compile Unit: length = 0x00000066 version = 0x0004 abbr_offset = 0x0000 addr_size = 0x08 (next unit at 0x000001a0) <...> 0x00000171: DW_TAG_structure_type DW_AT_name ("RawVec<alloc::string::String, alloc::heap::Heap>") DW_AT_byte_size (0x10) DW_AT_alignment (8)
0x00000178: DW_TAG_subprogram DW_AT_linkage_name ("_ZN5alloc7raw_vec8{{impl}}52allocate_in<alloc::string::String,alloc::heap::Heap>E") DW_AT_name ("allocate_in<alloc::string::String,alloc::heap::Heap>") DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs") DW_AT_decl_line (82) DW_AT_external (true) DW_AT_APPLE_optimized (true) DW_AT_inline (DW_INL_inlined)
0x00000184: DW_TAG_subprogram
DW_AT_linkage_name ("_ZN5alloc7raw_vec8{{impl}}36with_capacity
0x00000190: DW_TAG_formal_parameter DW_AT_name ("cap") DW_AT_decl_file ("/home/alex/code/rust4/lol/../lib.rs") DW_AT_decl_line (1) DW_AT_type (cu + 0x0078 "")
I think there are two things that we can/should do here:
Heh ok so this is where things could get a bit interesting. Unfortunately these are all massively reduced from the original IR which Rust tends to generate a lot of. I'll try to dump it all in though to see if that could help!
Unfortunately the files were a bit too large to upload here so I've made a gist [1]. There's two tarballs in that gist. First is inputs.tar.gz
. These are all the input IR files to the ThinLTO backend. Basically rustc created all these codegen units for the crate in question (not a reduced crate, the original source) and then after it produced those modules and ran optimizations on them the next step was to run ThinLTO across all the modules.
I checked and a bunch of the object files ended up having invalid dwarf, but I decided to focus on one. The stages.tar.gz
is the next most interesting tarball which contains the various stages of ThinLTO processing of the test.test0* module. This module's eventual object file was indeed invalid (according to dwarfdump), but the fun part was that each intermediate stage is valid!
The order of what happened here was:
renameModuleForThinLTO
thinLTOResolveWeakForLinkerModule
thinLTOInternalizeModule
FunctionImporter
Interestingly if you run all IR files through llc
to generate an object file, they all generate object files that dwarfdump doesn't complain about except the last one, after-pm.ll. I verified this by just running llc $input -o foo.o -filetype=obj && dwarfdump -i foo.o > /dev/null
, and only the after-pm.ll complained.
Even more interestingly I ran opt
manually over after-import.ll
and it didn't produce an invalid object file! The only difference between after-import.ll and after-pm.ll is after running ThinLTO optimization passes, but running the full set of optimization passes is apparently different enough that it correctly handles the dwarf info?
Unfortunately I don't know how to reproduce ThinLTO passes on the command line so I wasn't able to get a command line reproduction which takes after-import.ll
and then modifies it to a state which is invalid (when passed through llc).
I hope that's not too much information! If I can help clarify anything here I'd be more than willing to help reduce and make sure that it can be debugged.
https://gist.github.com/alexcrichton/a7dbb0062491640d9d15d06ce10d34ef
If I understand correctly, the problem can be summarized as two DISubprogram that each have a different DICompileUnit, but share a scope end up in the same DWARF compile unit.
Assuming this is the result of (thin)LTO, I'd be interested to have a look at the IR before it's linked in. Maybe that will tell us why they are sharing a scope, which I think is the fundamental issue here.
There's some discussion 1 on the upstream issue where it seems like LLVM may mostly assume that one LLVM module only has one codegen unit of debug information (dwarf-wise at least). I believe that the reason the multiple codegen units came into existence was because of ThinLTO (where ThinLTO inlined debug info presumably).
One possible fix 2 sounded like it may be to just inline debug info into the original codegen unit during ThinLTO inlining, but I'm not sure if that's (a) the best fix or (b) if it'd actually fix things.
assigned to @JDevlieghere
Extended Description
This is a minimization of an upstream issue at https://github.com/rust-lang/rust/issues/45511 where the Rust compiler, when using ThinLTO, generates DWARF information that the
dwarfdump
program interprets as invalid.I managed to minimize to the attached IR. The IR attached has been reduced via machines (creduce on Rust source code, bugpoint on LLVM IR) and also by hand. As a result this isn't really a direct translation of anything in Rust but rather an "artisinal" reduction which may or may not have invalidated the reduction itself along the way.
I unfortunately know very little about DWARF much less dwarfdump. I'm operating under the assumption that anything accepted by LLVM shouldn't generate an error with dwarfdump, but that may also be totally wrong! If so, please just let me know!
In any case, I can reproduce the error via:
$ llc -filetype=obj -O0 foo.ll -o foo.o $ dwarfdump -i foo.o > /dev/null
dwarfdump ERROR: reference form with no valid local ref?!, offset=<0x00000078>: DW_DLE_ATTR_FORM_OFFSET_BAD (119)