Stack overflow in release mode using `IntoParallelIterator` on `Range<i32>` - User error or Rayon issue?

pubfnbar commented 1 month ago

use std::sync::atomic::{AtomicU64, Ordering};

use rayon::prelude::*;

fn main() {
    static SUM: AtomicU64 = AtomicU64::new(0);

    (0..1000).into_par_iter().for_each(|i| {
        let x = [i as u8; 200_000];
        let sum: u64 = x.iter().map(|&a| u64::from(a)).sum();
        SUM.fetch_add(sum, Ordering::SeqCst);
    });
    println!("sum: {SUM:?}");
}

Given the above code in "main.rs", and rayon = "=1.10.0" in "Cargo.toml", running on Ubuntu 22.04: Executing "cargo run" prints "sum: 24943200000". Executing "cargo run --release" encounters stack overflow with at least one of the following messages:

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow

Adding with_min_len(5) after into_par_iter() allows "cargo run --release" to run without stack overflow.

If the above code is using Rayon incorrectly, then what is the correct and idiomatic way of looping a bunch of indices in parallel without blowing up the stack?

cuviper commented 1 month ago

The default thread stack size is 2 MiB. With your 200,000 byte local, that's only ~10 stack frames that can be nested before overflow.

Your code doesn't have any direct recursion, but rayon can cause that anyway due to work-stealing whenever a function stalls waiting for other rayon work. It would seem like your code wouldn't even have that, since your for_each closure doesn't make any rayon calls, but I think in release mode your closure is getting inlined into the rayon "plumbing" code that does a bunch of join recursion for the parallel iterator -- and each of those may also work-steal while waiting for the other half of the join to complete.

As a quick test, your code runs fine for me with .for_each(#[inline(never)] |i| { ... }), so this seems to confirm that theory. When it's not inlined, the large stack usage is short-lived, so it doesn't affect the work-stealing recursion. This is probably the best workaround for your code.

Adding with_min_len will limit how far the parallel iterator may split, which indirectly limits how much work-stealing recursion may occur, but the necessary limit may depend on your input length and number of threads. You could also build the pool with a larger stack_size, but again it's not easy to predict how much is sufficient.

Aside from your problem, this example could also use a parallel .map(...).sum() instead of using atomics, probably still doing the same serial sum within the map. I don't know if this is applicable to your real-world code, but it's something to keep in mind.

pubfnbar commented 1 month ago

The #[inline(never)] seems to be a 100% effective fix/workaround for this issue. In my real-world code that blew up the stack, the closure did a lot of different things (not a sum), and it had ~50 KB of stack allocated arrays in total, but the Range of indices iterated in parallel was larger, which seems to reduce the maximum amount of stack allocation the closure can do if it's inlined. For example, the following code blows up the stack with (only) a 40,000 byte array:

use std::sync::atomic::{AtomicU64, Ordering};

use rayon::prelude::*;

fn main() {
    static SUM: AtomicU64 = AtomicU64::new(0);

    (0..100_000).into_par_iter().for_each(|i| {
        let x = [i as u8; 40_000];
        let sum: u64 = x.iter().map(|&a| u64::from(a)).sum();
        SUM.fetch_add(sum, Ordering::SeqCst);
    });
    println!("sum: {SUM:?}");
}

Anyway, I get a sense that this is just inherent to the way Rayon works rather than something that could be "fixed" in the Rayon library code. And that's fine now that I know how to fix the issue (although I actually fixed my code simply by reducing stack allocation in the closure).

cuviper commented 1 month ago

Anyway, I get a sense that this is just inherent to the way Rayon works rather than something that could be "fixed" in the Rayon library code.

We could probably force some #[inline(never)] in the plumbing where it calls "user" code, but that would pessimize everyone, since we have no way to see how much stack a call will use.

(although I actually fixed my code simply by reducing stack allocation in the closure)

That seems best!

rayon-rs / rayon

Stack overflow in release mode using `IntoParallelIterator` on `Range<i32>` - User error or Rayon issue? #1194