libbpf / blazesym

blazesym is a library for address symbolization and related tasks
BSD 3-Clause "New" or "Revised" License
99 stars 24 forks source link

Gnu debug link CRC read failures #769

Closed r1viollet closed 1 month ago

r1viollet commented 1 month ago

Description

I was reported a failure on symbolizing some libraries using split debug. Investigating further I could see everything should be working as expected

So I went further to try and understand. I removed the debug link and replaced it, and that fixed it! Looking at the CRC read, with the broken version, the data is not a u32, which yields a failure:

blazecli symbolize elf --path Datadog.Profiler.Native.so 0x8e842
[61, 40, 17] // printing the content of the data before crc read
Error: failed to symbolize addresses

Caused by:
    failed to read debug link checksum

Versus the version that is succeeding (re-writing the debug link)

blazecli symbolize elf --path ./dupe.so 0x8e842                                                                            
[65, 61, 40, 17]
0x0000000008e842: anyhow::error::<impl anyhow::Error>::construct @ 0x8e7bd+0x85 /go/src/github.com/DataDog/apm-reliability/libddprof-build/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.81/src/error.rs:245:40
                  alloc::boxed::Box<T>::new @ /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs [inlined]

The difference is probably related to the version of objcopy being used and alignment constraints imposed. I will add the library that highlights this issue (it is open source code) and continue digging when I have some time.

r1viollet commented 1 month ago

I could not attach the libraries to the issue, so this will take a few extra download steps to reproduce.

If I modify the logic by just taking the last 4 Bytes, it fixes the issue. However I am not sure if this complies with elf specifications.

    let crc_bytes = &data[data_len - 4..];
    let crc = u32::from_le_bytes([crc_bytes[0], crc_bytes[1], crc_bytes[2], crc_bytes[3]]);

We can see that the address of the CRC is not aligned

Data before extracting checksum: [0, 0, 65, 61, 40, 17] - addr=0x743df30ec0b1
Checksum bytes: [65, 61, 40, 17], CRC: 0x11283d41
0x0000000008e842: anyhow::error::<impl anyhow::Error>::construct @ 0x8e7bd+0x85 /go/src/github.com/DataDog/apm-reliability/libddprof-build/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.81/src/error.rs:245:40
                  alloc::boxed::Box<T>::new @ /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs [inlined]
danielocfb commented 1 month ago

Thanks for the report! Will take a look today.

danielocfb commented 1 month ago

Our logic is based on https://sourceware.org/gdb/current/onlinedocs/gdb.html/Separate-Debug-Files.html, which states:

A debug link is a special section of the executable file named .gnu_debuglink. The section must contain:

A filename, with any leading directory components removed, followed by a zero byte,
zero to three bytes of padding, as needed to reach the next four-byte boundary within the section, and
a four-byte CRC checksum, stored in the same endianness used for the executable file itself.

To the best of my reading, that's exactly what we expect. It's interesting that basically nothing is aligned in this binary:

$ readelf --sections /tmp/repro/linux-x64/Datadog.Profiler.Native.so --wide
  [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
...
  [28] .gnu_debuglink    PROGBITS        0000000000000000 70b153 000024 00      0   0  1
...

So there is no alignment requirement on the section itself (Al=1). That's different to what I see on binaries I've looked at, but it's not necessarily wrong. The offset is 0x70b153 from the start of the binary (which is page aligned as per my understanding), so that is unaligned. Also not wrong (I believe), just uncommon.

Our alignment logic itself seems kosher:

--- src/dwarf/debug_link.rs
+++ src/dwarf/debug_link.rs
@@ -145,13 +145,16 @@ pub(crate) fn read_debug_link(parser: &ElfParser) -> Result<Option<(&OsStr, u32)
     // SANITY: We just found the index so the section should always be
     //         found.
     let mut data = parser.section_data(idx).unwrap();
+    println!("RAW DATA: {data:p}: {data:#x?}");
     let file = data
         .read_cstr()
         .ok_or_invalid_data(|| "failed to read debug link file name")?;
     let file = bytes_to_os_str(file.to_bytes())?;
+    println!("BEFORE ALIGN: {data:p}: {data:#x?}");
     let () = data.align(4).ok_or_invalid_data(|| {
         "debug link section contains insufficient data: checksum not found"
     })?;
+    println!("AFTER ALIGN: {data:p}: {data:#x?}");
     // TODO: The CRC value is in the same endianess as the ELF file itself. Once
     //       we support non-host endianesses we need to take that into account.
     let crc = data
RAW DATA: 0x7f4df760e153: [
    0x44,
    0x61,
    0x74,
    0x61,
    0x64,
    0x6f,
    0x67,
    0x2e,
    0x50,
    0x72,
    0x6f,
    0x66,
    0x69,
    0x6c,
    0x65,
    0x72,
    0x2e,
    0x4e,
    0x61,
    0x74,
    0x69,
    0x76,
    0x65,
    0x2e,
    0x64,
    0x65,
    0x62,
    0x75,
    0x67,
    0x0,
    0x0,
    0x0,
    0x69,
    0xc4,
    0xd4,
    0xa6,
]
BEFORE ALIGN: 0x7f4df760e171: [
    0x0,
    0x0,
    0x69,
    0xc4,
    0xd4,
    0xa6,
]
AFTER ALIGN: 0x7f4df760e174: [
    0xc4,
    0xd4,
    0xa6,
]

This all seems to be by the book.

Have you checked how other tools behave by any chance? If I open the file in gdb I see:

Missing separate debuginfo for /tmp/repro/linux-x64/Datadog.Profiler.Native.so.
The debuginfo package for this file is probably broken.

So it may be choking on the same issue.

llvm-symbolizer also does basically nothing, but it's not exactly vocal as to why that is (that is, it could conceivably have other reasons):

$ llvm-symbolizer --obj=/tmp/repro/linux-x64/Datadog.Profiler.Native.so --functions 0x8e842 --verbose --debug-file-directory=/tmp/repro/symbols/linux-x64/linux-x64
??
  Filename: ??
  Line: 0
  Column: 0

Same with llvm-addr2line:

$ llvm-addr2line --obj=/tmp/repro/linux-x64/Datadog.Profiler.Native.so --functions 0x8e842 --debug-file-directory=/tmp/repro/symbols/linux-x64/linux-x64 --verbose
??
  Filename: ??
  Line: 0
  Column: 0

And eu-addr2line:

$ eu-addr2line -e /tmp/repro/linux-x64/Datadog.Profiler.Native.so --functions 0x8e842
??
??:0

From a brief look, these all seem to be debug link aware. E.g., they work on a stripped binary with only a debug link when I use valid blazesym test data:

eu-addr2line -e .../blazesym/data/test-stable-addrs-stripped-with-link.bin --functions 0x2000100
factorial
.../blazesym/data/test-stable-addrs.c:10:27

So to me, everything seems to point to this being a faulty binary. What toolchain and version generated it?

danielocfb commented 1 month ago

On the other hand...

$ readelf -wk Datadog.Profiler.Native.so

Datadog.Profiler.Native.so: Found separate debug info file: Datadog.Profiler.Native.debug
Contents of the .gnu_debuglink section (loaded from Datadog.Profiler.Native.so):

  Separate debug info file: Datadog.Profiler.Native.debug
  CRC value: 0xa6d4c469

and

$ readelf Datadog.Profiler.Native.so --debug-dump=follow-links --process-links --symbols --wide
In linked file '/tmp/repro/linux-x64/Datadog.Profiler.Native.debug' symbol section '.symtab' contains 21459 entries:
<more symbols than without debug links following>

So readelf at least seems to be able to make sense of the data.

danielocfb commented 1 month ago

https://github.com/libbpf/blazesym/issues/769#issuecomment-2263475250

Actually, it seems if I link the .debug file in the directory of the main binary everything works with the other tools. So they don't seem to be choking on the CRC.

danielocfb commented 1 month ago

Looking at binutils' readelf, the reason it seems to work in their case is because they heap allocate memory for and then copy over the section contents before parsing them. Because the memory buffer returned there is aligned, everything works out.

danielocfb commented 1 month ago

So basically, the question is relative to what are things aligned. Now with that question in mind, if we read the specification again:

[...] zero to three bytes of padding, as needed to reach the next four-byte boundary within the section, [...]

Our alignment is not with respect to the beginning of the section, but rather the overall file. I suspect that's the crux of the matter.

danielocfb commented 1 month ago

@r1viollet this hopefully is fixed now. Thanks again for the report. If you see any issues, let us know. Also, feel free to reach out if you need this fix in a release.

r1viollet commented 1 month ago

Oh wow, that explains it. That is some lawyer level interpretation. Thanks for the thorough analysis! :bow:

r1viollet commented 1 month ago

I can use the commit, don't force a release just for me. Thanks again!