llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.8k stars 11.45k forks source link

llvm-gsymutil is failing with "child range not contained in parent" error #46501

Open petrhosek opened 4 years ago

petrhosek commented 4 years ago
Bugzilla Link 47157
Version trunk
OS All
CC @dwblaikie,@JDevlieghere,@walkerkd,@pogo59,@frobtech,@thanm

Extended Description

I haven't yet managed to create a minimal reproducer, so here is a full binary for now until I manage to reduce it: https://storage.googleapis.com/fuchsia-build/basic_fuzzer_test.lzma

To reproduce the issue, simply run:

$ llvm-gsymutil --convert=basic_fuzzer_test --out-file=basic_fuzzer_test.gsym
...
llvm-gsymutil: error: basic_fuzzer_test: child range not contained in parent

This binary is somewhat unusual in that it's a Go fuzzer (compiled with the Go compiler) that calls into C++ test harness (compiled with Clang), so the binary includes an entire Go runtime and the generated DWARF looks somewhat different from DWARF generated by Clang.

frobtech commented 3 years ago

DWARF is explicitly a loose specification and it's always wise for DWARF consumers to be flexible in their expectations of the input. Rather than demanding the input have a certain shape, always do the best you can discern the most useful interpretation of the input you get. When the input has an unexpected shape that makes it hard to be confident what interpretation is really most useful, then warn about the data but rarely should such a case be a hard error as invalid input data.

petrhosek commented 3 years ago

I wonder if we could adopt the same solution delve did in https://github.com/go-delve/delve/pull/1807 for llvm-gsymutil to make it more tolerant? While it'd be ideal to address this in Go, it might take a while until that happens, and even longer until all binaries are rebuilt using the new compiler.

dwblaikie commented 4 years ago

The other thing to think about is what clients that use this DWARF will expect. For example if you want to symbolicate an address to get the function name and scope correctly, clients will typically check the DW_TAG_compile_unit's ranges and see if the address falls within those ranges and if the address does, we search all of the DIEs. Most code I have seen will, when it encounters a DW_TAG_subprogram DIEs, just check that address range of that DIE and skip to the sibling of that DW_TAG_subprogram and continuing to search.

Hmmm that's making an assumption about nested subprograms that isn't valid for all languages. The skip-to-sibling works for C because it doesn't have nested functions, nor did C++ until relatively recently. But for something like Pascal nested procedures, I'd expect the DIE tree to reflect the lexical parentage, while the generated-code range of the inner procedure would not be contained within the generated-code range of its parent. The local class and lambda cases in C++ should follow a similar pattern IMO.

Yeah, but for now they don't. GCC and LLVM always put their subprogram definitions outside any nesting like that (I think GCC goes further and puts subprogram definitions always as direct children of the compile_unit - won't even define a subprogram in a namespace - instead it'll declare it in the namespace and define it at the root/CU scope), so at least some consumers (including llvm-symbolizer, for instance) rely on this - it's faster, means you can skip whole subtrees when searching for "which DWARF describes this instruction address". It doesn't handle all possible DWARF - but given how permissive DWARF is, I'm not sure that's a practical goal, generally.

(I wonder what GCC's DWARF for the Pascal example you gave is - I'm guessing they do something like they do with C++ Examples and declare the nested function inside the outer function, but define the nested function at the CU/root scope of the DIE tree)

pogo59 commented 4 years ago

The other thing to think about is what clients that use this DWARF will expect. For example if you want to symbolicate an address to get the function name and scope correctly, clients will typically check the DW_TAG_compile_unit's ranges and see if the address falls within those ranges and if the address does, we search all of the DIEs. Most code I have seen will, when it encounters a DW_TAG_subprogram DIEs, just check that address range of that DIE and skip to the sibling of that DW_TAG_subprogram and continuing to search.

Hmmm that's making an assumption about nested subprograms that isn't valid for all languages. The skip-to-sibling works for C because it doesn't have nested functions, nor did C++ until relatively recently. But for something like Pascal nested procedures, I'd expect the DIE tree to reflect the lexical parentage, while the generated-code range of the inner procedure would not be contained within the generated-code range of its parent. The local class and lambda cases in C++ should follow a similar pattern IMO.

The situation is different for inlined subprograms, which obviously should be contained within the caller's code range. But that's different from the lexical-parent case.

llvmbot commented 4 years ago

The other thing to think about is what clients that use this DWARF will expect. For example if you want to symbolicate an address to get the function name and scope correctly, clients will typically check the DW_TAG_compile_unit's ranges and see if the address falls within those ranges and if the address does, we search all of the DIEs. Most code I have seen will, when it encounters a DW_TAG_subprogram DIEs, just check that address range of that DIE and skip to the sibling of that DW_TAG_subprogram and continuing to search. Accelerator tables in DWARF also just point you to a compile unit that you must linearly search. LLVM code will fail to do the lookup on this DWARF because the code does exactly this:

DWARFContext::DIEsForAddress DWARFContext::getDIEsForAddress(uint64_t Address) {
  DIEsForAddress Result;

  DWARFCompileUnit *CU = getCompileUnitForAddress(Address);
  if (!CU)
    return Result;

  Result.CompileUnit = CU;
  Result.FunctionDIE = CU->getSubroutineForAddress(Address);

  std::vector<DWARFDie> Worklist;
  Worklist.push_back(Result.FunctionDIE);
  while (!Worklist.empty()) {
    DWARFDie DIE = Worklist.back();
    Worklist.pop_back();

    if (!DIE.isValid())
      continue;

    if (DIE.getTag() == DW_TAG_lexical_block &&
        DIE.addressRangeContainsAddress(Address)) {
      Result.BlockDIE = DIE;
      break;
    }

    for (auto Child : DIE)
      Worklist.push_back(Child);
  }

  return Result;
}

So all lookups will fail if your DWARF is as it currently emitted for any addresses in "runtime.dolockOSThread".

dwblaikie commented 4 years ago

(that case might not be warned on by llvm-dwarfdump --verify, though... maybe - I mean, I don't think GCC or LLVM produce that sort of DWARF anyway so any implementation there might be accidental)

dwblaikie commented 4 years ago

A colleague of mine is asking for clarification on whether this particular oddity is definitely illegal according to the DWARF spec. Is there a spot in the specification one could point to that disallows this particular insanity? Thanks.

Don't think so, no. Believe it falls under the whole "DWARF is a permissive standard" and there are certainly some non-trivial cases where it could make sense. A C++ lambda or other local type with member functions defined inline (couldn't be defined anywhere else) could be represented as a truly nested type (so you'd have subprogram { structure_type { subprogram { } }) but it might not make sense to describe the local type's member functions as being within the range outer subprogram - they are distinct functions.

thanm commented 4 years ago

A colleague of mine is asking for clarification on whether this particular oddity is definitely illegal according to the DWARF spec. Is there a spot in the specification one could point to that disallows this particular insanity? Thanks.

petrhosek commented 4 years ago

For reference, this is a workaround implemented by delve: https://github.com/go-delve/delve/pull/1807

petrhosek commented 4 years ago

This was already filed as an issue against Go over a year ago but so far there hasn't been any response: https://github.com/golang/go/issues/33188

llvmbot commented 4 years ago

This brings up an issue that we have had for a while: compilers and linkers generating DWARF with no checks on the validity of the produced DWARF. llvm-gsymutil is catching these kinds of issues when it parses the DWARF and bringing the issues to our attention. The functionality behind the "llvm-dwarfdump --verify" is available in:

bool DWARFContext::verify(raw_ostream &OS, DIDumpOptions DumpOpts);

So it is easy for tools that are based on llvm/clang sources to integrate this into debug builds of the drivers for the compiler and linker. We will probably need to make some modifications to the verification code to ensure it handles all files (unlinked .o files, linked executables and dynamic libraries) correctly, but it is a great way to verify the DWARF is correct before we pass it on to other tools.

So sounds like this bug should probably be switched over to the Go compiler to fix this DWARF issue?

petrhosek commented 4 years ago

This looks a like an issue in the Go compiler, so it's something we need to file a bug for with Go maintainers.

Whether llvm-gsymutil should just print a warning or fail, that's a great question. I'm not sure if there's any precedent among LLVM tools, but it'd be nice to make that behavior configurable, for example by having a -strict mode (other suggestions are welcome).

The way I discovered this issue (as well as llvm/llvm-project#46494) is by trying to enable llvm-gsymutil in our build as a post-processing step. Currently, that fails the build because of these issues, and while I hope that we can address those, it may take some time, especially if we need to fix other compilers, and so I'd like to make these errors non-fatal in the meantime even if it means that we aren't going to produce GSYM files for some binaries. We could achieve the same by wrapping llvm-gsymutil in a script that discards the exit code, but having a flag would be nicer.

dwblaikie commented 4 years ago

Either - it's baked into the IR representation to the best of understanding of it.

Any given IR instruction may have a debug location, that debug location has a chain of scopes - we only put them inside each other in the DIE tree when they're in that chain, and their address ranges are derived from that scope chain - I don't know of any way a child scope could include an address/range that is not included in its parent scope, due to the nature of this implementation strategy.

(there was a bug recently about instruction bundles (used on Hexagon and Sparc at least) - that results in a correct line table, but missing scopes in the DIE tree - but that wouldn't result in children with addresses not included in their parent so far as I can imagine)

Essentially this applies to any IR (single compilation, Thin or full LTO, JIT, weird hand-crafted IR, etc) - the nature of the way scoping is represented makes it, to the best of my knowledge, fairly impossible (always the possibility of bugs, to be sure - but I have a hard time imagining the shape of the bugs necessary to allow it in this situation) to have a child scope include addresses that are not in the parent scope.

llvmbot commented 4 years ago

Which LTO? Monolithic LTO of Thin LTO? Or Both?

dwblaikie commented 4 years ago

I don't know of any aspect of LLVM's LTO implementation that could lead to this DWARF being emitted, FWIW. The way scopes are represented makes it pretty impossible to have a nested scope that doesn't have a nested address range, to the best of my knowledge.

llvmbot commented 4 years ago

The main questions is should llvm-gsymutil be returning an error, or just a warning, when it encounters invalid DWARF? I am fine with either. There are many LTO binaries that have this issue, and it would be a shame to not generate a gsym if the DWARF has any errors.

dwblaikie commented 4 years ago

running llvm-dwarfdump --verify on this file gives a more informative error related to the "child range not contained in parent" including dumping the relevant DIEs. The first example the verifier prints is:

0x00145adb: DW_TAG_inlined_subroutine DW_AT_abstract_origin (0x000000000014199a "runtime.lockOSThread") DW_AT_ranges (0x0001b790 [0x000000000007dc1f, 0x000000000007dc2e) [0x000000000007dc2e, 0x000000000007dc34) [0x000000000007dc43, 0x000000000007dc44)) DW_AT_call_file ("./runtime/cgocall.go") DW_AT_call_line (190)

0x00145aea: DW_TAG_inlined_subroutine DW_AT_abstract_origin (0x00000000001419b3 "runtime.dolockOSThread") DW_AT_ranges (0x0001b7e0 [0x000000000007dc34, 0x000000000007dc43) [0x000000000007dc47, 0x000000000007dc4e) [0x000000000007dc4e, 0x000000000007dc52)) DW_AT_call_file ("./runtime/proc.go") DW_AT_call_line (3675)

And that certainly looks like bogus DWARF to me. The first range in the runtime.doLockOSThread is exactly not contained in the outer, runtime.lockOSThread range list, I think.