rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
10.82k stars 493 forks source link

stack overflow in nested ThreadPool.scope #751

Open jiacai2050 opened 4 years ago

jiacai2050 commented 4 years ago

Code below can reproduce this error.

use rayon;
use rayon::prelude::*;

fn main() {
    let pool_outer = rayon::ThreadPoolBuilder::new()
        .num_threads(2)
        .thread_name(|i| format!("outer-thread-{}", i))
        .build()
        .unwrap();
    let pool_inner = rayon::ThreadPoolBuilder::new()
        .num_threads(2)
        .thread_name(|i| format!("inner-thread-{}", i))
        .build()
        .unwrap();

    pool_outer.scope(|_| {
        (0..90000000).into_par_iter().for_each(|out_i| {
            pool_inner.scope(|_| {
                // rayon::scope(|_| {      // rayon::scope doesn't cause stack overflow
                let sum: i32 = (0..2000).into_par_iter().map(|i| i + 1).sum();
                if out_i == 1 {
                    println!("sum = {:?}, out_i = {}", sum, out_i);
                }
            })
        });
    })
}

output

thread '
thread 'outer-thread-0' has overflowed its stack
fatal runtime error: stack overflow
outer-thread-1' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

After debug the coredump file, it seems the error is caused by call rayon::iter::plumbing::bridge_producer_consumer::helper recursively, but this error only happen in nested ThreadPool.scope, and can't reproduce in rayon's global threadpool.

stable-x86_64-unknown-linux-gnu (default) rustc 1.42.0 (b8cedc004 2020-03-09)

cuviper commented 4 years ago

It would be nice if you included the code directly here in the issue for posterity, but I don't want to copy it here without permission. In summary:

I'm not surprised that this overflows the stack. The important point is that scope is a blocking call into the other pool, which is what lets you use non-'static lifetimes within. It could be borrowing locals, so we don't unwind the stack or anything like that. However, being a work stealing thread pool, "blocking" means we'll start looking for other work in the meantime. So we'll find another piece of that for_each to execute, leading to blocking on another cross-pool scope, then it goes back to work stealing again. Repeat until you run out of jobs or stack space.

A quick way to limit this would be to limit how many jobs we split your outer iterator into. By default, our adaptive splitting will aim for about twice the number of threads, in case the workload is not balanced, and then increases the number of splits each time work is stolen. If you called with_min_len(x) with some x appropriate for your workload, you'd limit how many for_each jobs are available in total, limiting how deep that blocked-scope stealing can get.

You also have a comment that a direct rayon::scope doesn't have problems. That's because this will keep running on the same pool "context", executing directly, so those threads don't encounter the same kind of block that leads to work stealing.

The example feels very artificial, but I'll trust that it's reasonably representative of something you're actually doing. I hope with a better understanding of rayon's work stealing, you might find a better way to coordinate your cross-threadpool work.

rayon version: 1.2.0

I don't think it will make a difference to your problem, but please try with the most current version when reporting a bug. Rayon 1.3.0 was released in December.

jiacai2050 commented 4 years ago

@cuviper Thanks for your explanation. I have copied example code here. Although the reason of this error is clear, I think it's hard to predict this in advance. This error arise when my colleague try to parallel some jobs in our codebase without the privity of the outer thread pool. I believe this can happen in any large project sooner or later.

Work stealing thread pool is beneficial for CPU utilization, but may not keep stealing to reach stack limit, or at least document this issue in scope.

cuviper commented 4 years ago

This error arise when my colleague try to parallel some jobs in our codebase without the privity of the outer thread pool. I believe this can happen in any large project sooner or later.

I'm curious -- can you discuss why you're mixing multiple thread pools at all?

Work stealing thread pool is beneficial for CPU utilization, but may not keep stealing to reach stack limit,

We don't really know the stack limit in general, nor can we predict whether the next work we call might use too much stack.

or at least document this issue in scope.

I'm open to better documentation, if we can figure out some good guidance here. Or maybe just a warning of this hazard is better than nothing. The problem can arise for any of the blocking calls -- install, join, scope, scope_fifo -- when calling from one pool to another.

nazar-pc commented 11 months ago

I believe this is possible to hit even with a single thread pool.

I have Arc<ThreadPool> shared in the app between several components and there could be a few places at a time that are doing something under thread_pool.install(). Most of the time it works correctly, but occasionally results in stack overflow.

bushuyev commented 6 months ago

I'm curious -- can you discuss why you're mixing multiple thread pools at all?

not 100% sure but looks like I've got a variant of this problem. In my case it was polars, which has its own thread pool, being invoked from app's par_iter loop.