Closed r1viollet closed 1 month ago
I could not attach the libraries to the issue, so this will take a few extra download steps to reproduce.
tar -xvf linux-native-symbols.tar.gz ./symbols/linux-x64/linux-x64/Datadog.Profiler.Native.debug
linux-x64/Datadog.Profiler.Native.so
If I modify the logic by just taking the last 4 Bytes, it fixes the issue. However I am not sure if this complies with elf specifications.
let crc_bytes = &data[data_len - 4..];
let crc = u32::from_le_bytes([crc_bytes[0], crc_bytes[1], crc_bytes[2], crc_bytes[3]]);
We can see that the address of the CRC is not aligned
Data before extracting checksum: [0, 0, 65, 61, 40, 17] - addr=0x743df30ec0b1
Checksum bytes: [65, 61, 40, 17], CRC: 0x11283d41
0x0000000008e842: anyhow::error::<impl anyhow::Error>::construct @ 0x8e7bd+0x85 /go/src/github.com/DataDog/apm-reliability/libddprof-build/.cargo/registry/src/github.com-1ecc6299db9ec823/anyhow-1.0.81/src/error.rs:245:40
alloc::boxed::Box<T>::new @ /rustc/84c898d65adf2f39a5a98507f1fe0ce10a2b8dbc/library/alloc/src/boxed.rs [inlined]
Thanks for the report! Will take a look today.
Our logic is based on https://sourceware.org/gdb/current/onlinedocs/gdb.html/Separate-Debug-Files.html, which states:
A debug link is a special section of the executable file named .gnu_debuglink. The section must contain:
A filename, with any leading directory components removed, followed by a zero byte,
zero to three bytes of padding, as needed to reach the next four-byte boundary within the section, and
a four-byte CRC checksum, stored in the same endianness used for the executable file itself.
To the best of my reading, that's exactly what we expect. It's interesting that basically nothing is aligned in this binary:
$ readelf --sections /tmp/repro/linux-x64/Datadog.Profiler.Native.so --wide
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
...
[28] .gnu_debuglink PROGBITS 0000000000000000 70b153 000024 00 0 0 1
...
So there is no alignment requirement on the section itself (Al=1
). That's different to what I see on binaries I've looked at, but it's not necessarily wrong. The offset is 0x70b153
from the start of the binary (which is page aligned as per my understanding), so that is unaligned. Also not wrong (I believe), just uncommon.
Our alignment logic itself seems kosher:
--- src/dwarf/debug_link.rs
+++ src/dwarf/debug_link.rs
@@ -145,13 +145,16 @@ pub(crate) fn read_debug_link(parser: &ElfParser) -> Result<Option<(&OsStr, u32)
// SANITY: We just found the index so the section should always be
// found.
let mut data = parser.section_data(idx).unwrap();
+ println!("RAW DATA: {data:p}: {data:#x?}");
let file = data
.read_cstr()
.ok_or_invalid_data(|| "failed to read debug link file name")?;
let file = bytes_to_os_str(file.to_bytes())?;
+ println!("BEFORE ALIGN: {data:p}: {data:#x?}");
let () = data.align(4).ok_or_invalid_data(|| {
"debug link section contains insufficient data: checksum not found"
})?;
+ println!("AFTER ALIGN: {data:p}: {data:#x?}");
// TODO: The CRC value is in the same endianess as the ELF file itself. Once
// we support non-host endianesses we need to take that into account.
let crc = data
RAW DATA: 0x7f4df760e153: [
0x44,
0x61,
0x74,
0x61,
0x64,
0x6f,
0x67,
0x2e,
0x50,
0x72,
0x6f,
0x66,
0x69,
0x6c,
0x65,
0x72,
0x2e,
0x4e,
0x61,
0x74,
0x69,
0x76,
0x65,
0x2e,
0x64,
0x65,
0x62,
0x75,
0x67,
0x0,
0x0,
0x0,
0x69,
0xc4,
0xd4,
0xa6,
]
BEFORE ALIGN: 0x7f4df760e171: [
0x0,
0x0,
0x69,
0xc4,
0xd4,
0xa6,
]
AFTER ALIGN: 0x7f4df760e174: [
0xc4,
0xd4,
0xa6,
]
This all seems to be by the book.
Have you checked how other tools behave by any chance? If I open the file in gdb
I see:
Missing separate debuginfo for /tmp/repro/linux-x64/Datadog.Profiler.Native.so.
The debuginfo package for this file is probably broken.
So it may be choking on the same issue.
llvm-symbolizer
also does basically nothing, but it's not exactly vocal as to why that is (that is, it could conceivably have other reasons):
$ llvm-symbolizer --obj=/tmp/repro/linux-x64/Datadog.Profiler.Native.so --functions 0x8e842 --verbose --debug-file-directory=/tmp/repro/symbols/linux-x64/linux-x64
??
Filename: ??
Line: 0
Column: 0
Same with llvm-addr2line
:
$ llvm-addr2line --obj=/tmp/repro/linux-x64/Datadog.Profiler.Native.so --functions 0x8e842 --debug-file-directory=/tmp/repro/symbols/linux-x64/linux-x64 --verbose
??
Filename: ??
Line: 0
Column: 0
And eu-addr2line
:
$ eu-addr2line -e /tmp/repro/linux-x64/Datadog.Profiler.Native.so --functions 0x8e842
??
??:0
From a brief look, these all seem to be debug link aware. E.g., they work on a stripped binary with only a debug link when I use valid blazesym
test data:
eu-addr2line -e .../blazesym/data/test-stable-addrs-stripped-with-link.bin --functions 0x2000100
factorial
.../blazesym/data/test-stable-addrs.c:10:27
So to me, everything seems to point to this being a faulty binary. What toolchain and version generated it?
On the other hand...
$ readelf -wk Datadog.Profiler.Native.so
Datadog.Profiler.Native.so: Found separate debug info file: Datadog.Profiler.Native.debug
Contents of the .gnu_debuglink section (loaded from Datadog.Profiler.Native.so):
Separate debug info file: Datadog.Profiler.Native.debug
CRC value: 0xa6d4c469
and
$ readelf Datadog.Profiler.Native.so --debug-dump=follow-links --process-links --symbols --wide
In linked file '/tmp/repro/linux-x64/Datadog.Profiler.Native.debug' symbol section '.symtab' contains 21459 entries:
<more symbols than without debug links following>
So readelf
at least seems to be able to make sense of the data.
https://github.com/libbpf/blazesym/issues/769#issuecomment-2263475250
Actually, it seems if I link the .debug
file in the directory of the main binary everything works with the other tools. So they don't seem to be choking on the CRC.
Looking at binutils
' readelf
, the reason it seems to work in their case is because they heap allocate memory for and then copy over the section contents before parsing them. Because the memory buffer returned there is aligned, everything works out.
So basically, the question is relative to what are things aligned. Now with that question in mind, if we read the specification again:
[...] zero to three bytes of padding, as needed to reach the next four-byte boundary within the section, [...]
Our alignment is not with respect to the beginning of the section, but rather the overall file. I suspect that's the crux of the matter.
@r1viollet this hopefully is fixed now. Thanks again for the report. If you see any issues, let us know. Also, feel free to reach out if you need this fix in a release.
Oh wow, that explains it. That is some lawyer level interpretation. Thanks for the thorough analysis! :bow:
I can use the commit, don't force a release just for me. Thanks again!
Description
I was reported a failure on symbolizing some libraries using split debug. Investigating further I could see everything should be working as expected
So I went further to try and understand. I removed the debug link and replaced it, and that fixed it! Looking at the CRC read, with the broken version, the data is not a u32, which yields a failure:
Versus the version that is succeeding (re-writing the debug link)
The difference is probably related to the version of objcopy being used and alignment constraints imposed. I will add the library that highlights this issue (it is open source code) and continue digging when I have some time.