Open jiacai2050 opened 4 years ago
It would be nice if you included the code directly here in the issue for posterity, but I don't want to copy it here without permission. In summary:
for_each
on millions of items
scope
in the other poolsum
I'm not surprised that this overflows the stack. The important point is that scope
is a blocking call into the other pool, which is what lets you use non-'static
lifetimes within. It could be borrowing locals, so we don't unwind the stack or anything like that. However, being a work stealing thread pool, "blocking" means we'll start looking for other work in the meantime. So we'll find another piece of that for_each
to execute, leading to blocking on another cross-pool scope
, then it goes back to work stealing again. Repeat until you run out of jobs or stack space.
A quick way to limit this would be to limit how many jobs we split your outer iterator into. By default, our adaptive splitting will aim for about twice the number of threads, in case the workload is not balanced, and then increases the number of splits each time work is stolen. If you called with_min_len(x)
with some x
appropriate for your workload, you'd limit how many for_each
jobs are available in total, limiting how deep that blocked-scope stealing can get.
You also have a comment that a direct rayon::scope
doesn't have problems. That's because this will keep running on the same pool "context", executing directly, so those threads don't encounter the same kind of block that leads to work stealing.
The example feels very artificial, but I'll trust that it's reasonably representative of something you're actually doing. I hope with a better understanding of rayon's work stealing, you might find a better way to coordinate your cross-threadpool work.
rayon version: 1.2.0
I don't think it will make a difference to your problem, but please try with the most current version when reporting a bug. Rayon 1.3.0 was released in December.
@cuviper Thanks for your explanation. I have copied example code here. Although the reason of this error is clear, I think it's hard to predict this in advance. This error arise when my colleague try to parallel some jobs in our codebase without the privity of the outer thread pool. I believe this can happen in any large project sooner or later.
Work stealing thread pool is beneficial for CPU utilization, but may not keep stealing to reach stack limit, or at least document this issue in scope
.
This error arise when my colleague try to parallel some jobs in our codebase without the privity of the outer thread pool. I believe this can happen in any large project sooner or later.
I'm curious -- can you discuss why you're mixing multiple thread pools at all?
Work stealing thread pool is beneficial for CPU utilization, but may not keep stealing to reach stack limit,
We don't really know the stack limit in general, nor can we predict whether the next work we call might use too much stack.
or at least document this issue in
scope
.
I'm open to better documentation, if we can figure out some good guidance here. Or maybe just a warning of this hazard is better than nothing. The problem can arise for any of the blocking calls -- install
, join
, scope
, scope_fifo
-- when calling from one pool to another.
I believe this is possible to hit even with a single thread pool.
I have Arc<ThreadPool>
shared in the app between several components and there could be a few places at a time that are doing something under thread_pool.install()
. Most of the time it works correctly, but occasionally results in stack overflow.
I'm curious -- can you discuss why you're mixing multiple thread pools at all?
not 100% sure but looks like I've got a variant of this problem. In my case it was polars, which has its own thread pool, being invoked from app's par_iter loop.
Code below can reproduce this error.
output
After debug the coredump file, it seems the error is caused by call
rayon::iter::plumbing::bridge_producer_consumer::helper
recursively, but this error only happen in nested ThreadPool.scope, and can't reproduce in rayon's global threadpool.stable-x86_64-unknown-linux-gnu (default) rustc 1.42.0 (b8cedc004 2020-03-09)