Open cuviper opened 2 months ago
I think this was already reported? https://github.com/rust-lang/rust/issues/123686
@saethlin I mentioned that one at the end:
This is similar to #123686, but that proposed fix only applies to Windows.
and AFAICS that report doesn't depend on -Cpanic=abort
either.
It might be a duplicate of #121817, cc @wesleywiser
The current behaviour doesn't surprise me since we neither force frame pointers, nor force unwind tables. To generate the stack trace usually what happens is that eh_frame to unwind the frames is used if it exists, otherwise frame pointers are used to continue unwind the frames. We build the standard library with -Cpanic=unwind
so the eh_frame is available, but past begin_panic
nothing is available anymore.
Adding -Cforce-frame-pointers=yes
allows termination (still no traces beyond begin_panic
, and adding -Cforce-unwind-tables=yes
produces a property stack trace.
97235 adds a lint (which isn't even executed with panic=abort), and should have no codegen changes.
OK, I don't know. The bisected nightly works fine when I build it myself, which implies that it's sensitive to the particular configuration, but that makes it hard to try and narrow this further. Or if the bug only arose there because of happenstance code changes, the root cause might not be in that range at all.
I can reproduce the infinite backtrace with 2020-03-01 and the target is gone in 2020-01-01 so I suspect that this bug has been in the target since we started distributing artifacts for it.
I cannot reproduce the infinite backtrace printing with -Zbuild-std
. I suspect the problem has something to do with the fact that the standard library is precompiled with -Cpanic=unwind
then we're linking to it with -Cpanic=abort
, and I don't know how to imitate that with -Zbuild-std
.
The behavior of the MIR inliner is sensitive to the stable crate hash, so if the behavior of the MIR inliner can impact whether the fatal optimization happens, it may not be possible to bisect this.
The current behaviour doesn't surprise me since we neither force frame pointers, nor force unwind tables. To generate the stack trace usually what happens is that eh_frame to unwind the frames is used if it exists, otherwise frame pointers are used to continue unwind the frames. We build the standard library with
-Cpanic=unwind
so the eh_frame is available, but pastbegin_panic
nothing is available anymore.
To be clear, the behavior seems wrong to me even using -Zbuild-std
:
ubuntu@ip-172-31-13-98:~/demo$ RUSTFLAGS=-Cpanic=abort RUST_BACKTRACE=full cargo +nightly r -Zbuild-std=std,panic_abort --target=aarch64-unknown-linux-gnu
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
Running `target/aarch64-unknown-linux-gnu/debug/demo`
thread 'main' panicked at src/main.rs:1:13:
explicit panic
stack backtrace:
0: 0xaaaabefe12a8 - std::backtrace_rs::backtrace::libunwind::trace::h3eb3677c648e214e
at /home/ubuntu/.rustup/toolchains/nightly-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
1: 0xaaaabefe12a8 - std::backtrace_rs::backtrace::trace_unsynchronized::h6b663bf10ab98a13
at /home/ubuntu/.rustup/toolchains/nightly-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
Aborted (core dumped)
Buggy in a different way, but still buggy, so maybe it's not worth splitting hairs.
The current behaviour doesn't surprise me since we neither force frame pointers, nor force unwind tables. To generate the stack trace usually what happens is that eh_frame to unwind the frames is used if it exists, otherwise frame pointers are used to continue unwind the frames. We build the standard library with
-Cpanic=unwind
so the eh_frame is available, but pastbegin_panic
nothing is available anymore.
Shouldn't the unwinder just give up if it reaches a frame with no unwinding information? This seems to be the issue here since the loop start when the unwinder reaches the first frame compiled with panic=abort
.
This target uses libgcc as the unwinder, this might be worth looking more deeply into.
A truncated backtrace is normal if everything is compiled with panic=abort
since we don't emit eh_frame by default so the unwinder doesn't know how to unwind the stack. What shouldn't be happening is that the unwinder ends up in an infinite loop since that indicates invalid unwind information.
Also note that the call into panic is a non-returning tail call. This has caused issues in the past on the ARM target (#69231) since LLVM would assume the LR value is dead and leave it dangling. If this happens to hold a function pointer to std::panicking::begin_panic
then this could lead to the infinite loop we are seeing.
Huh? I thought AArch64 code was being compiled with frame pointers?
WG-prioritization assigning priority (Zulip discussion).
@rustbot label -I-prioritize +P-high
Shouldn't the unwinder just give up if it reaches a frame with no unwinding information?
For backtracing, libgcc does have a mechanism to use a fallback mechanism if unwind info is not available. Although I just checked that it is implemented for PPC only, so this wasn't the case for aarch64.
This unfortunately makes me unable to explain why the looping behaviour disappears when we force frame pointer though.
Huh? I thought AArch64 code was being compiled with frame pointers?
Doesn't look like it. https://github.com/rust-lang/rust/blob/master/compiler/rustc_target/src/spec/targets/aarch64_unknown_linux_gnu.rs
Apparently we have frame pointers enabled for non-leaf functions on Apple platforms, but not Linux.
Another interesting data point. When building for aarch64-unknown-linux-musl
, I get this output:
stack backtrace:
0: 0x419218 - std::backtrace_rs::backtrace::libunwind::trace::h41b2803078d25074
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/../../backtrace/src/backtrace/libunwind.rs:105:5
1: 0x419218 - std::backtrace_rs::backtrace::trace_unsynchronized::hb38469496ae402aa
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x419218 - std::sys_common::backtrace::_print_fmt::h0cf53d01d01865f9
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/sys_common/backtrace.rs:68:5
3: 0x419218 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h2016a34b1ca538fd
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/sys_common/backtrace.rs:44:22
4: 0x44476c - core::fmt::rt::Argument::fmt::h96d6bfc4fa1065a9
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/core/src/fmt/rt.rs:142:9
5: 0x44476c - core::fmt::write::h3a1e5be1e1cfe579
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/core/src/fmt/mod.rs:1153:17
6: 0x4173e8 - std::io::Write::write_fmt::hcc53d7d63a73e855
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/io/mod.rs:1843:15
7: 0x419024 - std::sys_common::backtrace::_print::hae3e089c5d3e39b2
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/sys_common/backtrace.rs:47:5
8: 0x419024 - std::sys_common::backtrace::print::h1182a252cb3edf9c
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/sys_common/backtrace.rs:34:9
9: 0x41a474 - std::panicking::default_hook::{{closure}}::he5c426f80ff73911
10: 0x41a114 - std::panicking::default_hook::h5274ad032677b37f
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/panicking.rs:292:9
11: 0x41a990 - std::panicking::rust_panic_with_hook::h45e7e3752affffd6
at /rustc/c67326b063bd27ed04f306ba2e372cd92e0a8751/library/std/src/panicking.rs:779:13
12: 0x4022c8 - std::panicking::begin_panic::{{closure}}::h275b05efe23611e5
13: 0x430254 - _Unwind_Backtrace
Note that the last line is clearly incorrect: _Unwind_Backtrace
does not call begin_panic
.
More clues:
std::panicking::rust_panic_with_hook
is built as part of the standard library with -Cpanic=unwind
.std::panicking::begin_panic::{{closure}}
is generic and built as part of the main program with -Cpanic=abort
. I checked the binary with objdump and it does not have any eh_frame data.What should have happened is that we stop the backtrace at std::panicking::begin_panic::{{closure}}
because it has no unwind info to continue further. But for some reason we attempt to unwind despite the lack of unwind info, which causes this problem. Is the backtrace unwinder automatically falling back to frame pointers?
/* Procedure Call Standard for the ARM 64-bit Architecture (AArch64) * specifies that the end of the frame record chain is indicated by * the address zero in the address for the previous frame. */
OK I figured out the root cause of the problem. It's because LLVM is clobbering the link register on the tail call:
0000000000006924 <_ZN3std9panicking11begin_panic28_$u7b$$u7b$closure$u7d$$u7d$17h3fd05c3e3e7919b0E>:
6924: d100c3ff sub sp, sp, #0x30
6928: aa0003e8 mov x8, x0
692c: f940010a ldr x10, [x8]
6930: f9400509 ldr x9, [x8, #8]
6934: f9000bea str x10, [sp, #16]
6938: f9000fe9 str x9, [sp, #24]
693c: f9400bea ldr x10, [sp, #16]
6940: f9400fe9 ldr x9, [sp, #24]
6944: 910003e0 mov x0, sp
6948: f90003ea str x10, [sp]
694c: f90007e9 str x9, [sp, #8]
6950: f9400903 ldr x3, [x8, #16]
6954: f00002a1 adrp x1, 5d000 <GCC_except_table144+0x159b4>
6958: 91248021 add x1, x1, #0x920
695c: aa1f03e2 mov x2, xzr
6960: 52800028 mov w8, #0x1 // #1
6964: 12000104 and w4, w8, #0x1
6968: 2a1f03e8 mov w8, wzr
696c: 12000105 and w5, w8, #0x1
6970: 940060d9 bl 1ecd4 <_ZN3std9panicking20rust_panic_with_hook17h06b7ef7eb6d8944bE>
Note how the incoming LR value is not saved anywhere and is later clobbered by the BL instruction.
Also, my earlier statement was incorrect, _ZN3std9panicking11begin_panic28_$u7b$$u7b$closure$u7d$$u7d$17h3fd05c3e3e7919b0E
does have unwinding information:
00000088 0000000000000010 00000000 CIE
Version: 1
Augmentation: "zR"
Code alignment factor: 1
Data alignment factor: -4
Return address column: 30
Augmentation data: 1b
DW_CFA_def_cfa: r31 (sp) ofs 0
0000009c 0000000000000010 00000018 FDE cie=00000088 pc=0000000000006924..0000000000006974
DW_CFA_advance_loc: 4 to 0000000000006928
DW_CFA_def_cfa_offset: 48
This unwind info is claiming that the return address is present in the link register (this is the default if not overridden), but this is clearly incorrect in this case since the link register has been clobbered.
So the unwinder is behaving correctly, it's just that LLVM is emitting incorrect unwind info (or that it should be preserving the caller's link register).
Here is the LLVM IR for this function. I think the root of the problem is that we are assigning a personality function to this function, which forces LLVM to emit DWARF frame metadata for this function. However this is a nounwind function that only calls other nounwind functions, so LLVM clobbering the link register is perfectly valid.
I see 2 ways of fixing this:
uwtable
attribute.; ModuleID = '<stdin>'
source_filename = "rust_out.7d368e479ffa2ca5-cgu.0"
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-musl"
%"std::panicking::begin_panic::Payload<&str>" = type { %"core::option::Option<&str>" }
%"core::option::Option<&str>" = type { ptr, [1 x i64] }
@vtable.1 = external hidden unnamed_addr constant <{ ptr, [16 x i8], ptr, ptr }>, align 8
; Function Attrs: inlinehint noreturn nounwind
define hidden fastcc void @"_ZN3std9panicking11begin_panic28_$u7b$$u7b$closure$u7d$$u7d$17h275b05efe23611e5E"(ptr noalias nocapture noundef readonly align 8 dereferenceable(24) %_1) unnamed_addr #0 personality ptr @rust_eh_personality {
start:
%_4 = alloca %"std::panicking::begin_panic::Payload<&str>", align 8
call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %_4)
%inner.0 = load ptr, ptr %_1, align 8, !nonnull !3, !align !4, !noundef !3
%0 = getelementptr inbounds i8, ptr %_1, i64 8
%inner.1 = load i64, ptr %0, align 8, !noundef !3
store ptr %inner.0, ptr %_4, align 8
%1 = getelementptr inbounds i8, ptr %_4, i64 8
store i64 %inner.1, ptr %1, align 8
%2 = getelementptr inbounds i8, ptr %_1, i64 16
%_6 = load ptr, ptr %2, align 8, !nonnull !3, !align !5, !noundef !3
call void @_ZN3std9panicking20rust_panic_with_hook17h45e7e3752affffd6E(ptr noundef nonnull align 1 %_4, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) @vtable.1, ptr noalias noundef readonly align 8 dereferenceable_or_null(48) null, ptr noalias noundef nonnull readonly align 8 dereferenceable(24) %_6, i1 noundef zeroext true, i1 noundef zeroext false) #4
unreachable
}
; Function Attrs: nounwind
declare noundef i32 @rust_eh_personality(i32 noundef, i32 noundef, i64 noundef, ptr noundef, ptr noundef) unnamed_addr #1
; Function Attrs: noreturn nounwind
declare void @_ZN3std9panicking20rust_panic_with_hook17h45e7e3752affffd6E(ptr noundef nonnull align 1, ptr noalias noundef readonly align 8 dereferenceable(24), ptr noalias noundef readonly align 8 dereferenceable_or_null(48), ptr noalias noundef readonly align 8 dereferenceable(24), i1 noundef zeroext, i1 noundef zeroext) unnamed_addr #2
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(argmem: readwrite)
declare void @llvm.lifetime.start.p0(i64 immarg, ptr nocapture) #3
attributes #0 = { inlinehint noreturn nounwind "probe-stack"="inline-asm" "target-cpu"="generic" "target-features"="+v8a" }
attributes #1 = { nounwind "probe-stack"="inline-asm" "target-cpu"="generic" "target-features"="+v8a" }
attributes #2 = { noreturn nounwind "probe-stack"="inline-asm" "target-cpu"="generic" "target-features"="+v8a" }
attributes #3 = { nocallback nofree nosync nounwind willreturn memory(argmem: readwrite) }
attributes #4 = { noreturn nounwind }
!llvm.module.flags = !{!0, !1}
!llvm.ident = !{!2}
!0 = !{i32 8, !"PIC Level", i32 2}
!1 = !{i32 7, !"PIE Level", i32 2}
!2 = !{!"rustc version 1.78.0-nightly (c67326b06 2024-03-15)"}
!3 = !{}
!4 = !{i64 1}
!5 = !{i64 8}
I suppose we can also just emit the uwtable
attribute whenever we attach a personality function?
Huh? I thought AArch64 code was being compiled with frame pointers?
Doesn't look like it. https://github.com/rust-lang/rust/blob/master/compiler/rustc_target/src/spec/targets/aarch64_unknown_linux_gnu.rs
Apparently we have frame pointers enabled for non-leaf functions on Apple platforms, but not Linux.
Mmm. We probably ought to be adopting frame pointers across the board for AAPCS64 reasons. It's technically acceptable for a platform to deviate, but it should be explicit.
I think this also affects riscv64 in Fedora/RISCV. Looking at the builds logs it seems to be panic_abort_doc_tests
. It just runs forever slowly consuming the memory until it gets killed.
Huh? I thought AArch64 code was being compiled with frame pointers?
To clarify, Fedora does force-enable frame pointers for x86_64 and aarch64, so this might be affecting Rust builds on Fedora.
I reproduced the problem using nightlies for this report, so I don't think Fedora's configuration matters.
@Amanieu I reproduced it. In my case, I saw that the LLVM IR output was:
; std::panicking::begin_panic::{{closure}}
; Function Attrs: inlinehint noreturn nounwind
define internal void @"_ZN3std9panicking11begin_panic28_$u7b$$u7b$closure$u7d$$u7d$17h4691076880adeaeeE"(ptr align 8 %_1) unnamed_addr #5 personality ptr @rust_eh_personality {
start:
%0 = alloca [16 x i8], align 8
%_7 = alloca [16 x i8], align 8
%_4 = alloca [16 x i8], align 8
%inner.0 = load ptr, ptr %_1, align 8
%1 = getelementptr inbounds i8, ptr %_1, i64 8
%inner.1 = load i64, ptr %1, align 8
store ptr %inner.0, ptr %_7, align 8
%2 = getelementptr inbounds i8, ptr %_7, i64 8
store i64 %inner.1, ptr %2, align 8
%3 = load ptr, ptr %_7, align 8
%4 = getelementptr inbounds i8, ptr %_7, i64 8
%5 = load i64, ptr %4, align 8
store ptr %3, ptr %_4, align 8
%6 = getelementptr inbounds i8, ptr %_4, i64 8
store i64 %5, ptr %6, align 8
%7 = getelementptr inbounds i8, ptr %_1, i64 16
%_6 = load ptr, ptr %7, align 8
; call std::panicking::rust_panic_with_hook
call void @_ZN3std9panicking20rust_panic_with_hook17h9e3397264ef8c828E(ptr align 1 %_4, ptr align 8 @vtable.1, ptr align 8 null, ptr align 8 %_6, i1 zeroext true, i1 zeroext false) #13
unreachable
bb1: ; No predecessors!
%8 = load ptr, ptr %0, align 8
%9 = getelementptr inbounds i8, ptr %0, i64 8
%10 = load i32, ptr %9, align 8
%11 = insertvalue { ptr, i32 } poison, ptr %8, 0
%12 = insertvalue { ptr, i32 } %11, i32 %10, 1
resume { ptr, i32 } %12
}
Due to the resume
(which for some reason does not exist in your code), we need to add a personality to this function. Otherwise, we would get an error from LLVM.
In any case, if I understand correctlly, the condition that sets the personality function in this case is: https://github.com/rust-lang/rust/blob/master/compiler/rustc_codegen_ssa/src/mir/mod.rs#L182 which I don't fully understand TBH.
Also for everyone without access to a aarch64 machine and want to reproduce (paths are maybe a bit different):
echo 'fn main() { panic!() }' | rustc - -Cpanic=abort --target aarch64-unknown-linux-gnu -C "linker=aarch64-linux-gnu-gcc" && qemu-aarch64 -E LD_LIBRARY_PATH="/usr/aarch64-linux-gnu/lib64" -E RUST_BACKTRACE=full -L /usr/aarch64-linux-gnu/ ./rust_out
@Reflexe Essentially, whenever we add a personality function we must ensure that we also add the uwtable
attribute to the function, otherwise LLVM generates invalid unwinding information.
@Amanieu Thanks! this was pretty easy to implement, I created https://github.com/rust-lang/rust/pull/125844.
There are two issues still
uwtable
has any effect here: LLVM already generates an unwinding table entry for this function, is it correct to say that it generates an incorrect unwinding table entry?> echo 'fn main() { panic!() }' | rustc - -Cpanic=abort --target aarch64-unknown-linux-gnu -C "linker=aarch64-linux-gnu-gcc" && qemu-aarch64 -E LD_LIBRARY_PATH="/usr/aarch64-linux-gnu/lib64" -E RUST_BACKTRACE=full -L /usr/aarch64-linux-gnu/ ./rust_out
thread 'main' panicked at <anon>:1:13:
explicit panic
stack backtrace:
0: 0x7cd4114333c8 - trace
at /home/reflexe/rust/rust/library/std/src/../../backtrace/src/backtrace/libunwind.rs:116:5
1: 0x7cd4114333c8 - trace_unsynchronized<std::sys_common::backtrace::_print_fmt::{closure_env#1}>
at /home/reflexe/rust/rust/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x7cd4114333c8 - _print_fmt
at /home/reflexe/rust/rust/library/std/src/sys_common/backtrace.rs:68:5
3: 0x7cd4114333c8 - fmt
at /home/reflexe/rust/rust/library/std/src/sys_common/backtrace.rs:44:22
4: 0x7cd411448b80 - fmt
at /home/reflexe/rust/rust/library/core/src/fmt/rt.rs:165:63
5: 0x7cd411448b80 - write
at /home/reflexe/rust/rust/library/core/src/fmt/mod.rs:1168:21
6: 0x7cd41141a65c - write_fmt<std::sys::pal::unix::stdio::Stderr>
at /home/reflexe/rust/rust/library/std/src/io/mod.rs:1835:15
7: 0x7cd41143324c - _print
at /home/reflexe/rust/rust/library/std/src/sys_common/backtrace.rs:47:5
8: 0x7cd41143324c - print
at /home/reflexe/rust/rust/library/std/src/sys_common/backtrace.rs:34:9
9: 0x7cd41142e548 - {closure#1}
10: 0x7cd41142e22c - default_hook
at /home/reflexe/rust/rust/library/std/src/panicking.rs:298:9
11: 0x7cd41142e964 - rust_panic_with_hook
at /home/reflexe/rust/rust/library/std/src/panicking.rs:795:13
12: 0x7cd4114123f0 - std::panicking::begin_panic::{{closure}}::h4691076880adeaee
13: 0x7cd4114122f0 - std::sys_common::backtrace::__rust_end_short_backtrace::h8c00d8b5eac77590
The last entry is incorrect here. Not sure if this is related or we should open another issue for this one. Thanks for the help.
- I don't fully understand why adding
uwtable
has any effect here: LLVM already generates an unwinding table entry for this function, is it correct to say that it generates an incorrect unwinding table entry?
Yes, it generates an incorrect entry if the uwtable
attribute is absent. Specifically, it fails to preserve some registers while the unwinding tables indicate that they are actually preserved.
- The issue you saw at the beginning still exists:
No, the last entry here is correct: the backtrace properly ends at __rust_end_short_backtrace which comes from user code and therefore has no unwinding metadata.
When compiled with
-Cpanic=abort
, a program as simple asfn main() { panic!(); }
gets into a loop withRUST_BACKTRACE=full
onaarch64-unknown-linux-gnu
:With
RUST_BACKTRACE=1
, it has no backtrace at all:Current toolchain:
Using
cargo bisect-rustc
, this problem seems to be quite old:Comparison: https://github.com/rust-lang/rust/compare/46b8c23f3eb5e4d0e0aa27eb3f20d5b8fc3ed51f...f2d93935ffba3ab9d7ccb5300771a2d29b4c8bf3
97235 looks like the most obvious candidate, cc @nbdd0121 @Amanieu
The reason I'm noticing this now is because there's a new cargo test
panic_abort_doc_tests
from rust-lang/cargo#13388, which I found hanging on Fedora aarch64 in a scratch build of 1.78-beta. After about 2 hours, rustdoc got aSIGKILL
, which I assume is an OOM from capturing all that backtrace output. When reproducing this hang, I found that running itsRUST_BACKTRACE=full /tmp/rustdoctestGHjZyP/rust_out
myself showed the backtrace loop.This is similar to #123686, but that proposed fix only applies to Windows.