rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
97.81k stars 12.66k forks source link

`std::thread::Builder::spawn()` panics #78497

Open BartMassey opened 3 years ago

BartMassey commented 3 years ago

When trying to create a large number of threads, std::thread::Builder::spawn() can panic rather than returning an error.

I wrote this:

fn main() {
    let n: usize = std::env::args().nth(1).unwrap().parse().unwrap();

    let mut threads = Vec::new();
    for i in 0..n {
        let b = std::thread::Builder::new();
        let t = b.spawn(|| {
            std::thread::sleep(std::time::Duration::from_secs(3));
        }).unwrap_or_else(|e| {
            println!("{}: {}", i, e);
            std::process::exit(1);
        });
        threads.push(t);
    }
    println!("spawned");
    for t in threads {
        t.join().unwrap();
    }
    println!("joined");
}

When run as cargo run --release 10000 on my box this works as I expected: prints "spawned" in a half-second, "joined" three seconds after that. When run as cargo run --release 100000 I get

thread '<unnamed>' panicked at 'failed to set up alternative stack guard page', 16341: Resource temporarily unavailable (os error 11)
library/std/src/sys/unix/stack_overflow.rs:156:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5

From the documentation for Builder::spawn() I expected to get an Err back instead of the panic.

Meta

rustc --version --verbose:

rustc 1.47.0 (18bf6b4f0 2020-10-07)
binary: rustc
commit-hash: 18bf6b4f01a6feaf7259ba7cdae58031af1b7b39
commit-date: 2020-10-07
host: x86_64-unknown-linux-gnu
release: 1.47.0
LLVM version: 11.0

My environment: current Debian Linux, 32GB memory.

Fails the same way on nightly 2020-10-25. No stack backtrace is available.

yorickpeterse commented 3 years ago

On my system it appears the threshold to trigger this is around 33 000 threads. Take this program for example:

use std::thread;
use std::time::{Duration, Instant};

fn main() {
    let count = 33_000;

    // "min" defaults to 100 so we get the real lowest timing, instead of 0.
    let mut min = Duration::from_secs(100);
    let mut max = Duration::from_secs(0);
    let mut total = Duration::from_secs(0);
    let mut handles = Vec::with_capacity(count);

    for i in 0..count {
        let start = Instant::now();
        let handle = thread::Builder::new()
            .stack_size(1 * 1024 * 1024)
            .spawn(move || assert!(i <= count)) // Just some dummy work
            .unwrap();

        let duration = start.elapsed();

        handles.push(handle);

        if duration < min {
            min = duration;
        } else if duration > max {
            max = duration;
        }

        total += duration;
    }

    let avg = total / (count as u32);

    for handle in handles {
        handle.join().unwrap();
    }

    println!(
        "min: {:2} µsec, avg: {:2} µsec, max: {:2} µsec",
        min.as_micros(),
        avg.as_micros(),
        max.as_micros()
    );
}

If I run this, the program will panic like so:

thread '<unnamed>' panicked at 'failed to set up alternative stack guard pagethread '', <unnamed>library/std/src/sys/unix/stack_overflow.rs' panicked at ':failed to allocate an alternative stack156', :library/std/src/sys/unix/stack_overflow.rs13:
thread '152note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
main:' panicked at '13called `Result::unwrap()` on an `Err` value: Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }
', src/main.rs:18:14
fatal runtime error: failed to initiate panic, error 5
fatal runtime error: failed to initiate panic, error 5

If I lower the number of threads to <33 000, e.g. 32 500, it runs just fine.

yorickpeterse commented 3 years ago

Should somebody run into a similar case: this is because of hitting the limit of vm.max_map_count, which on Linux defaults to 65530 (roughly the equivalent of 32GB of VRAM if I'm not mistaken). Bumping up that limit lets you work around the panic.

Setting that aside, I agree with OP: this should produce an Err instead of a panic.

Xuanwo commented 2 years ago

The panic happens at: https://github.com/rust-lang/rust/blob/5e93f6e318e687c05c8c44517de4586ed75ce3f4/library/std/src/sys/unix/stack_overflow.rs#L145-L163

How can we produce an Err here? Maybe we should unwind https://doc.rust-lang.org/src/std/sys/unix/thread.rs.html#100-108 ?

BartMassey commented 2 years ago

Yes, it's a mystery how to fix this one without massive rewriting. I rewrote library/std/src/sys/unix/stack_overflow.rs to return io::Result instead of panicking, but then it's hard to see how to get stack_overflow::Handler::new() to be invoked in parent context where that error is useful: it may require breaking things up into pieces so that the mmap()s can be done outside the child thread and passed in. Ugh.

BartMassey commented 2 years ago

Here is a patch I wrote that purports to close this: https://gist.github.com/BartMassey/abb862f42e1144224dc9a77897eb34a9

I haven't put in a pull request yet because I think this needs careful review and testing even before that. In particular, I have tried to be careful about memory allocation and freeing both on the heap and through mmap, but I'm not convinced I have all the semantics right. All I know is that the test at the top of this issue now passes, and valgrind reports no leaks in that case. What's the easiest way to check for mmapped memory leaks?

This patch does quite a bit of violence to unix/stack_overflow.rs and a bit to unix/thread.rs. It works by doing the mmap() and mprotect() for the new signal stack in the parent thread before trying to spawn the child: that way an error can be returned if either of these operations fails.The child either uses this memory for a new signal stack or releases it if it doesn't need one (in which case there's some useless overhead that I can't see how to avoid).

I think unix/stack_overflow.rs could be cleaned up still further, and some potential bugs fixed. For now, though, I'll leave it.

Thanks for looking at this!

BartMassey commented 2 years ago

Sorry about the title changes: just couldn't live with :thread: any longer.