Open BartMassey opened 3 years ago
On my system it appears the threshold to trigger this is around 33 000 threads. Take this program for example:
use std::thread;
use std::time::{Duration, Instant};
fn main() {
let count = 33_000;
// "min" defaults to 100 so we get the real lowest timing, instead of 0.
let mut min = Duration::from_secs(100);
let mut max = Duration::from_secs(0);
let mut total = Duration::from_secs(0);
let mut handles = Vec::with_capacity(count);
for i in 0..count {
let start = Instant::now();
let handle = thread::Builder::new()
.stack_size(1 * 1024 * 1024)
.spawn(move || assert!(i <= count)) // Just some dummy work
.unwrap();
let duration = start.elapsed();
handles.push(handle);
if duration < min {
min = duration;
} else if duration > max {
max = duration;
}
total += duration;
}
let avg = total / (count as u32);
for handle in handles {
handle.join().unwrap();
}
println!(
"min: {:2} µsec, avg: {:2} µsec, max: {:2} µsec",
min.as_micros(),
avg.as_micros(),
max.as_micros()
);
}
If I run this, the program will panic like so:
thread '<unnamed>' panicked at 'failed to set up alternative stack guard pagethread '', <unnamed>library/std/src/sys/unix/stack_overflow.rs' panicked at ':failed to allocate an alternative stack156', :library/std/src/sys/unix/stack_overflow.rs13:
thread '152note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
main:' panicked at '13called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
', src/main.rs:18:14
fatal runtime error: failed to initiate panic, error 5
fatal runtime error: failed to initiate panic, error 5
If I lower the number of threads to <33 000, e.g. 32 500, it runs just fine.
Should somebody run into a similar case: this is because of hitting the limit of vm.max_map_count
, which on Linux defaults to 65530 (roughly the equivalent of 32GB of VRAM if I'm not mistaken). Bumping up that limit lets you work around the panic.
Setting that aside, I agree with OP: this should produce an Err
instead of a panic.
The panic happens at: https://github.com/rust-lang/rust/blob/5e93f6e318e687c05c8c44517de4586ed75ce3f4/library/std/src/sys/unix/stack_overflow.rs#L145-L163
How can we produce an Err
here? Maybe we should unwind
https://doc.rust-lang.org/src/std/sys/unix/thread.rs.html#100-108 ?
Yes, it's a mystery how to fix this one without massive rewriting. I rewrote library/std/src/sys/unix/stack_overflow.rs
to return io::Result
instead of panicking, but then it's hard to see how to get stack_overflow::Handler::new()
to be invoked in parent context where that error is useful: it may require breaking things up into pieces so that the mmap()
s can be done outside the child thread and passed in. Ugh.
Here is a patch I wrote that purports to close this: https://gist.github.com/BartMassey/abb862f42e1144224dc9a77897eb34a9
I haven't put in a pull request yet because I think this needs careful review and testing even before that. In particular, I have tried to be careful about memory allocation and freeing both on the heap and through mmap, but I'm not convinced I have all the semantics right. All I know is that the test at the top of this issue now passes, and valgrind reports no leaks in that case. What's the easiest way to check for mmapped memory leaks?
This patch does quite a bit of violence to unix/stack_overflow.rs
and a bit to unix/thread.rs
. It works by doing the mmap()
and mprotect()
for the new signal stack in the parent thread before trying to spawn the child: that way an error can be returned if either of these operations fails.The child either uses this memory for a new signal stack or releases it if it doesn't need one (in which case there's some useless overhead that I can't see how to avoid).
I think unix/stack_overflow.rs
could be cleaned up still further, and some potential bugs fixed. For now, though, I'll leave it.
Thanks for looking at this!
Sorry about the title changes: just couldn't live with :thread: any longer.
When trying to create a large number of threads,
std::thread::Builder::spawn()
can panic rather than returning an error.I wrote this:
When run as
cargo run --release 10000
on my box this works as I expected: prints "spawned" in a half-second, "joined" three seconds after that. When run ascargo run --release 100000
I getFrom the documentation for
Builder::spawn()
I expected to get anErr
back instead of the panic.Meta
rustc --version --verbose
:My environment: current Debian Linux, 32GB memory.
Fails the same way on nightly 2020-10-25. No stack backtrace is available.