Open Rot127 opened 6 months ago
Rust built by CI links against an old glibc version for backwards compatibility. Maybe symbol versioning makes a difference? Having strace
print stacktraces for each syscall might shed some light if different paths are taken.
Local libc version:
> /usr/lib64/libc.so.6
GNU C Library (GNU libc) stable release version 2.39.
Copyright (C) 2024 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 14.0.1 20240411 (Red Hat 14.0.1-0).
libc ABIs: UNIQUE IFUNC ABSOLUTE
Minimum supported kernel: 3.2.0
For bug reporting instructions, please see:
<https://www.gnu.org/software/libc/bugs.html>.
Having strace print stacktraces for each syscall might shed some light if different paths are taken.
They are indeed very different for the scope()
function. But it doesn't seem to be related to libc version:
CI toolchain with abort:
futex(0x7569c4474a08, FUTEX_LOCK_PI, NULL) = -1 EPERM (Operation not permitted)
/usr/lib64/libc.so.6(__futex_lock_pi64+0x25) [0x929c5]
/usr/lib64/libc.so.6(__pthread_mutex_lock_full+0x267) [0x99207]
/usr/lib64/libc.so.6(__cxa_thread_atexit_impl+0x69) [0x42779]
mylib.so(std::sys_common::thread_info::current_thread+0x3b) [0x11d3eb]
mylib.so(std::thread::current+0x5) [0x118085]
mylib.so(std::thread::scoped::scope+0x83) [0x7af23]
mylib.so(do_something+0x205) [0x43425]
With the locally built toolchain it never reaches __futex_lock_pi64
. The next syscall executed is from within the scope()
closure.
/usr/lib64/libc.so.6(__write+0x4d) [0x10b86d]
mylib.so(std::sys::pal::unix::fd::FileDesc::write+0x26) [0x14a126]
mylib.so(<std::sys::pal::unix::stdio::Stdout as std::io::Write>::write+0x34) [0x136154]
mylib.so(<std::io::stdio::StdoutRaw as std::io::Write>::write+0x1a) [0x147dea]
mylib.so(std::io::buffered::bufwriter::BufWriter<W>::flush_buf+0x85) [0x137fd5]
mylib.so(<std::io::buffered::bufwriter::BufWriter<W> as std::io::Write>::flush+0x8) [0x1381a8]
mylib.so(<&std::io::stdio::Stdout as std::io::Write>::flush+0x3d) [0x14804d]
mylib.so(<std::io::stdio::Stdout as std::io::Write>::flush+0xd) [0x147fad]
mylib.so(helper::progress::ProgressBar::print+0xee6) [0x122106]
mylib.so(helper::progress::ProgressBar::update_print+0x69) [0x1222f9]
mylib.so(do_something::{{closure}}+0x61) [0x8b811]
mylib.so(std::thread::scoped::scope::{{closure}}+0x35) [0x89db5]
mylib.so(<core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once+0x20) [0xa1d70]
mylib.so(std::panicking::try::do_call+0x2b) [0x846cb]
mylib.so(__rust_try+0x1a) [0x84b1a]
mylib.so(std::panicking::try+0x51) [0x84551]
mylib.so(std::thread::scoped::scope+0x2e5) [0x89a95]
mylib.so(do_something+0x205) [0x522b5]
Your from-source toolchain is also 1.78? There were some recent changes around thread locals and thread parking on master.
Yes:
> git show HEAD
commit 9b00956e56009bab2aa15d7bff10916599e3d6d6 (HEAD, tag: 1.78.0, origin/stable, stable)
...
But let me try with latest master
and see if the stack trace changes again.
Same result as above. __futex_lock_pi64
is never called.
Building with the self build toolchain as described above now also gives me the error.
But only, if there are certain println!()
. It is really really weird.
But building with the stable
toolchain from rustup
works in these cases.
After removing some println!
and fields of some structs, I now get with the rustup
toolchain:
Fatal glibc error: tpp.c:83 (__pthread_tpp_change_priority): assertion failed: new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)
Aborted (core dumped)
The program runs with two threads (one main, one scoped spawned). Self built works fine again.
With 1.78.0
I now have even the problem, that scoped()
is not even entered.
println!("Before scoped");
thread::scope(|s| {
println!("Into scoped");
...
});
println!("After scoped");
prints only Before scoped
scoped and freezes.
Everything works fine with toolchain <=1.76.0
. 1.77.0/1.78.0
freeze.
There are only three lines of actual code difference.
Ok, seems to work fine in 1.79.0
. Will keep this open for a while and test. But it can be closed probably.
Still a problem in 1.79.0
. It seems to be related to formatted strings?
assert
in libc
: Fatal glibc error: tpp.c:83 (__pthread_tpp_change_priority): assertion failed: new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)
Triggered by having a formatted debug_assert!
:
fn some_fcn() {
for x in self
.get_locked_values()
.write()
.unwrap() {
// ...
debug_assert!(
<condition>,
"Some formatted string: {} {} {}",
arg0,
arg1,
arg2
);
}
}
some_fcn
is called by another function, within the local thread.
Removing any of the arguments will make the code work. Two of the arguments are clones or references of members of the write-locked struct.
The last one from outside the loop.
I can also split up the debug in two print!
and a debug_assert!(false)
and it doesn't crash. Except if an argument in println!
has a format specifier like {:#x}
. So it seems to be connected to the number of arguments passed and their formatting.
Crashes:
print!("bla {}", arg0);
println!(
"at {:#x} balbla {} blabla", // Without :#x it works fine.
arg1, arg2
);
Using simply 0,1,2
for the arguments make the code work again.
I would debug it, but really don't have time, unfortunately.
Stack-trace:
Fatal glibc error: tpp.c:83 (__pthread_tpp_change_priority): assertion failed: new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)
==272775==
==272775== Process terminating with default action of signal 6 (SIGABRT): dumping core
==272775== at 0x4B5B4A4: __pthread_kill_implementation (pthread_kill.c:44)
==272775== by 0x4B02C4D: raise (raise.c:26)
==272775== by 0x4AEA901: abort (abort.c:79)
==272775== by 0x4AEB766: __libc_message_impl.cold (libc_fatal.c:132)
==272775== by 0x4AFABB6: __libc_assert_fail (__libc_assert_fail.c:31)
==272775== by 0x4B61C95: __pthread_tpp_change_priority (tpp.c:83)
==272775== by 0x4B5C473: __pthread_mutex_lock_full (pthread_mutex_lock.c:567)
==272775== by 0x4B04D99: __cxa_thread_atexit_impl (cxa_thread_atexit_impl.c:117)
==272775== by 0x80DD408: register_dtor<core::cell::once::OnceCell<std::thread::Thread>> (fast_local.rs:161)
==272775== by 0x80DD408: __getit (fast_local.rs:56)
==272775== by 0x80DD408: try_with<core::cell::once::OnceCell<std::thread::Thread>, std::thread::try_current::{closure_env#0}, std::thread::Thread> (local.rs:285)
==272775== by 0x80DD408: try_current (mod.rs:716)
==272775== by 0x80DD408: std::thread::current (mod.rs:741)
==272775== by 0x7FBDDF4: std::thread::scoped::scope (scoped.rs:144)
So the assert can also be reached by printing std::thread::current().id()
.
Fatal glibc error: tpp.c:83 (__pthread_tpp_change_priority): assertion failed: new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)
Fatal glibc error: tpp.c:83 (__pthread_tpp_change_priority): assertion failed: new_prio == -1 || (new_prio >= fifo_min_prio && new_prio <= fifo_max_prio)
Still can't come up with a minimal working example. But will continue to try.
NOTE
Problem described below is fixed in
1.79.0
. Another version of it happens now. Skip to https://github.com/rust-lang/rust/issues/124920#issuecomment-2236197798System: OS: Fedora 40 Arch: x86_64 Toolchain:
https://sh.rustup.rs
v1.78.0
Disclaimer
The following bug is really really weird, and I struggle to make a minimal working example. Unfortunately, I don't want to share the code it appears in publicly yet. But will invite everyone who wants to fix it, to the repo.
Description
When compiling something like the following code, with the toolchain installed via
curl sh.rustup.rs
,thread::scope
aborts withThe futex facility returned an unexpected error code.
.The
strace
shows that the thread is trying to attach itself to a futex, it is not allowed to attach to (according to the man pages):Now to the funny part. This error only happens with the toolchain obtained from
sh.rustup.rs
. I built the same version locally with and without debug symbols and the error goes away. I assume this happens due to different optimizations done by my locally built toolchains and therustup
one?Also, with my locally built toolchains
futex
is not called at all (in the function where the abort happens).Additional clues
Logs
Full
strace
:Valgrind stacktrace