rayon feature causes deadlocks when used inside of an existing rayon threadpool

awused commented 2 years ago

When loading a jpeg with the rayon feature enabled the calling thread does no work of its own and only blocks until the work is completed on other threads, waiting for mpsc messages to come through. If the task is already running from inside the context of a threadpool this will block one of those threads until the other threads can process the actions. If an application is using a rayon threadpool - even just the global default rayon pool - to parallelize the loading of images it's possible for enough jpegs to start loading at once to exhaust the pool and deadlock themselves.

I believe refactoring it to use rayon::in_place_scope would work, but a more expedient option is probably to just have a dedicated rayon threadpool for decoding. If nothing else this should be documented and maybe made non-default.

This might apply to other image decoders if any use rayon, I've not inspected their code, but jpegs are especially common.

HeroicKatora commented 2 years ago

If I understand, then the situation is as follows:

rayon-Pool
|-----|
|  …  |
| Jpg | --> { decode_internal } ----------\
|  …  |                   blocked on multithreaded.rs#L54
|  …  |                                    v
|-----|<--blocked-on free worker--| Component0 | Component1 | Component2 |

Is it possible to write a reliable test/reproduction for this? The best fix for this, that I cant think of, would be to allow the current thread running decode_internal to complete each work item, alternatively to the block on recv. This would allow rayon to silently degrade to single-threaded use by having the Jpg-Worker complete the components itself. However, I'm not sure how to achieve this with the API. The current parallelization strategy is also somewhat crude in the first place.

HeroicKatora commented 2 years ago

Minimized reproduction:

    let pool = rayon::ThreadPoolBuilder::new()
        .num_threads(1)
        .build()
        .unwrap();
    pool.install(|| {
        let mut decoder = Decoder::new(File::open(&path).unwrap());
        let _ = decoder.decode().unwrap();
    });

HeroicKatora commented 2 years ago

I dont see how to fix this without a massive rework on the structure of code. Basically, rayon only let's us group tasks (and deterministically block on them) via a lexical scope. We do not have a lexical scope for components. Indeed, different components may finish at different times and via completely different methods. We might be able to turn the complete decode_internal into a single scope and communicate via some kind of out-parameter but this may involve multably sharing this out-parameter to different task—again difficult. And I don't see how this would be compatible with the std-multithreading which can't spawn a non-'static task to a separate thread.

awused commented 2 years ago

I think keeping a reasonably small dedicated threadpool around would be acceptable enough. The overhead should be minimal, but since it does add permanent overhead it probably makes sense to make it non-default. There are also scoped threadpool crates that are stable and battle-tested enough that could be used for non-rayon.

If adding any dependencies or keeping idle threadpools around indefinitely isn't viable then https://github.com/rust-lang/rust/issues/93203 seems to be making progress towards stabilization. As an outsider I think it'd be reasonable to disable the rayon feature entirely - to avoid the case where a transitive dependency on image re-enables jpeg_rayon - until that lands and the decoding can be refactored.

HeroicKatora commented 2 years ago

As an outsider I think it'd be reasonable to disable the rayon feature entirely - to avoid the case where a transitive dependency on image re-enables jpeg_rayon - until that lands and the decoding can be refactored.

Well, no. The linked issue and design is not a replacement for rayon. What rayon provides is the ability to create a thread pool and spawn jobs on those work-stealing threads, removing some overhead of potentially costly sys-calls (clone) and semi-automatic workload balancing. This is not the goal of rust-lang/rust#93203. That issue is just concerned with enabling non-'static closures to be mixed with the std::thread implementation (which is neither shared for consumers nor work-stealing), but this problem is not something we need solved here.

This is also a reason why 'creating our own dedicated threadpool' is technically questionable. That's throwing half of rayon's features and use-case out of the window. But, alas, better have a correct program than an incorrect but fast one.

awused commented 2 years ago

Well, no. The linked issue and design is not a replacement for rayon.

I did not imply that it was, and frankly I'm not sure how you misread it that way. That was in direct response to you saying And I don't see how this would be compatible with the std-multithreading which can't spawn a non-'static task to a separate thread. because it allows for non-'static tasks to be used with std multithreading.

This is also a reason why 'creating our own dedicated threadpool' is technically questionable.

I said nothing about writing a threadpool implementation from scratch. Rayon provides the ability to create and manage threadpools, which you even did in your own minimized reproduction of the issue in https://github.com/image-rs/jpeg-decoder/issues/227#issuecomment-1047106586.

awused commented 2 years ago

To spell it out in no uncertain terms, I see a few options, all of which are reasonable to me as an outsider.

Keep a dedicated rayon threadpool for use with the rayon feature.
- Now when called from the context of an existing rayon threadpool all the real work will be done in that dedicated threadpool, leaving the calling thread free to block.
- This requires the least work and is compatible with the existing std implementation.
- Alternately you could construct a new rayon threadpool per call, but that rubs me the wrong way.
Disable the rayon feature entirely, then when https://github.com/rust-lang/rust/issues/93203 lands, migrate the std multi-threading code to use scope.
- This keeps the rayon version as efficient as it can be (no second threadpool)
- This requires refactoring both code paths and disabling the rayon feature for an uncertain length of time.
Using one of the existing scoped threadpool implementations in place of the std multi-threading code, then refactor the rayon version to not block the calling thread.
- This adds an additional dependency and makes the "std" code not really fully "std", while still requiring all the work to migrate both versions.

Of course those were only the simplest, easiest options I could think of and the ones I directly alluded to in my previous comments.

HeroicKatora commented 2 years ago

Please take a look at PR #230 regarding dedicated rayon threadpool. This is the (somwhat wrong as we agree) threadpool per call because the existing structure allowed for this more easily than other models. Also I should say 'threadpool-per-library' also rubs me the wrong way as it can not be effectively initialized if there are different consumers with different needs? The std-multithreading spawns (up to) 4 threads per image while a lazy_static!-threadpool (or similar) would be some static number per library. Maybe we should make this an explicit parameter during construction of the decoder? (E.g. A setter that accepts some struct that internally holds Arc<ThreadPool>.)

We have no need to migrate std multithreading to anything else. It's not the inability to use scoped thread that kepts us from doing a different than the current work structure. It's the coupling between decoding and actual work creation that did. rayon does not have join handles which makes it very hard to create tasks depending on results of previous tasks. To achieve the flexibility in different component treatment that the specification requires we must either be able to do that, or we introduce multiple 'global' join points which again loses some of the benefits of multithreading. No thing as free lunch.

awused commented 2 years ago

I have no strong opinions on how you architect your solution, I only offered my suggestions when it seemed like you were considering a massive rework and specifically called out not being able to spawn non-'static tasks in std. I've not looked deeply into the code after determining the root cause of my deadlocks.

fintelia commented 2 years ago

Rayon lets you detect whether you're running within a threadpool via rayon::current_thread_index().is_some(). As a quick workaround, would it be better for now to just fallback to our single-threaded decode implementation if we detect that?

HeroicKatora commented 2 years ago

specifically called out not being able to spawn non-'static tasks in std.

I can see that could be a source of confusion but the emphasis there should have been on compatible. Indeed, the std version works, so this should be the one that stays in-place. Don't touch the running system. The open question was rather how the scope-based-rayon-version can be re-architected such that this is feasible and compatible with the existing one that we don't change.

Rayon lets you detect whether you're running within a threadpool via rayon::current_thread_index().is_some(). As a quick workaround, would it be better for now to just fallback to our single-threaded decode implementation if we detect that?

That's insufficient. We need to detect if the number of threads available is larger than the number of blocking tasks (so, larger than the number of concurrent decode calls). This is completely separate from any concern about running in the global or a local thread pool. Afaik, we can not get a reference to the enclosing ThreadPool in either case even if we are in a worker thread.

Given the above my preferred solution would be to have the code that creates the thread pool live inside jpeg-decoder, where we can control (or monitor) the number of threads available and in-use. Then we only need to make it possible to share that instance among a set of decoders small enough to fulfill the above condition..

fintelia commented 2 years ago

That's insufficient. We need to detect if the number of threads available is larger than the number of blocking tasks (so, larger than the number of concurrent decode calls).

When called from a rayon threadpool, we need to ensure that if we do any blocking operations from within decode there are more threads available than other blocking tasks. My proposal is that decode should never do any blocking operations when called from a threadpool. This sidesteps the question of figuring out how many threads there are available.

awused commented 2 years ago

I can see that could be a source of confusion but the emphasis there should have been on compatible.

There was no confusion, not on my part anyway. I said the decoding can be refactored when that landed if that was the blocker on a large refactoring so that the std multi-threading code and the rayon code could both remain compatible with each other. I think you just misread my comment entirely. Maybe you didn't like the suggestion, okay, but do not say I was confused for even making it.

That's insufficient. We need to detect if the number of threads available is larger than the number of blocking tasks (so, larger than the number of concurrent decode calls). This is completely separate from any concern about running in the global or a local thread pool. Afaik, we can not get a reference to the enclosing ThreadPool in either case even if we are in a worker thread.

You have misunderstood the suggestion. It's to change the implementation used, at run time, to the std multi-threading version if you detect you're already in a rayon threadpool. The number of other available threads in an existing rayon threadpool doesn't matter if you're running the std multi-threading code, which spawns its own threads, and it is fine to block one thread in a rayon threadpool (the one you've been "given" by the caller) as long as you're not awaiting responses from other tasks on the same pool. You can also fall back to spawning an additional rayon threadpool only when called from an existing rayon threadpool.

fintelia commented 2 years ago

Moderation note: Let's keep this focused on figuring out a technical solution. There is no need to argue over who is confused, or who misunderstood whom, or whatever.

HeroicKatora commented 2 years ago

if you detect you're already in a rayon threadpool

One will always be spawning tasks into a threadpool, albeit this may be the global one. Thus one might always dead lock in the same way if too many threads/decoding tasks are running (as those block waiting on results). The exact ThreadPool context has no further bearing on correctness, we have even less control over the global one instead of a locally installed one. We can never spawn tasks into a rayon threadpool we didn't create ourselves with the current task structure (and I'm somewhat surprised anyone is able to do complex tasks at all with the library given the lack of an explicit, by-value JoinHandle).

awused commented 2 years ago

One will always be spawning tasks into a threadpool, albeit this may be the global one. Thus one might always dead lock in the same way if too many threads/decoding tasks are running.

The std multi-threading code does not spawn rayon tasks. So this is irrelevant.

At this point I think I'll just not comment on this thread any more if my comments are not going to be read and if I'm not able to clear up misunderstandings.

(and I'm somewhat surprised anyone is able to do complex tasks at all with the library given the lack of an explicit, by-value JoinHandle)

Most work in rayon is done with parallel iterators, which this crate didn't use so that it could be compatible with std threads. It's optimized for non-I/O bound work so dependencies between tasks aren't really optimized for. Taking a very brief look at the code where workers are started and get_result() is called, it seems like it wouldn't be terribly hard to convert to a parallel iterator and fall back to std multi-threading with scope(), but I will not look in more depth.

HeroicKatora commented 2 years ago

My proposal is that decode should never do any blocking operations when called from a threadpool.

Without a decently large refactor, our multithreaded task system always blocks at some point—the single threaded does not. This involves worker/tasks internally. As those tasks would always run on threadpool no matter what we do, implementing this suggestion implies, effectively, 'disable use of rayon'. At least we can keep std-multithreading unchanged this way. The above code exposition of an explicit thread pool scope was simply the easiest reproduction that offered enough control to deterministically reproduce. But the problem also occurs when we are using the global thread pool. It's just more hidden, not deterministic, and requires multiple decode calls in parallel.

Current assumed situation our internal architecture:

Single Thread (works)
Multi Threaded std (works)
Multi Threaded rayon (broken): Except if we can control the worker pool's number of threads to be strictly more than the number of concurrent decode calls.

Potential approaches:

Disable rayon. As assumed above, this puts the library into a good state.
Use a new rayon ThreadPool for the decoder: Simply spawn more than one thread. As decode(&mut) can only called once concurrently this guarantees the requirement to avoid the deadlock.
Share our internally created rayon ThreadPool with a limited number of decoders. Also does this but is more complex. This avoids creating new workers for each Decoder though, which is one main technical advantage beyond the use std-multithreading.
Rewrite the task system to remove the blocking code. Since rayon requires lexical scoping to get results, this would mean identifying serialization points at which all component results can be collected at the same time and then again faned out in new tasks to run the remaining compute_image computation. This is potentially less concurrent than the existing version and requires some reachitecturing. In any case, the existing std-multithreading can likely also cope with this solution, using the result retrieval it already implements but with the new order of computation.

Non-Solutions:

Detect the rayon context. As we can not introspect, nor control, the surrounding ThreadPool that will be in use (no matter if global or local) this can not provide the guarantee we need for forward progress.
Use a dedicated, but global jpeg-decoder ThreadPool. Again, this offers no control over the number of concurrent decodings that may happen. Since rayon can not rescale ThreadPools at runtime, we can't implement the required guarantee identified above.
Changing, replacing or removing std multithreading.
std-'scope-Tasks, as std works fine in either the current situation or lexical scoping. They might prove to have some small runtime advantage in the future though, due to having less communication latency (no explicit mpsc channel for the result).

awused commented 2 years ago

Detect the rayon context. As we can not introspect, nor control, the surrounding ThreadPool that will be in use (no matter if global or local) this can not provide the guarantee we need for forward progress.

If called from an existing rayon threadpool, choosing to fall back to the single-threaded implementation or the std multi-threading code is enough to avoid deadlocks and guarantee forward progress. You can always block the thread you have been called on, that is not the problem. The problem only exists when blocking on rayon tasks from within the same threadpool, so if no rayon tasks are spawned then there is no deadlock. This does mean choosing the code path at run time, not at compile time based on feature flags, but the overhead would be negligible since rayon::spawn, itself, checks if it's being called from an existing threadpool using the same mechanism.

HeroicKatora commented 2 years ago

This does mean choosing the code path at run time

Under which conditions, do you suppose, will we ever dynamically choose the rayon-utilizing code path? The whole point is that we can never safely do so. We will not be interacting with rayon at all in this case, no matter what kinds of feature flags are provided (which we will keep only for compatibility with downstream includes), nor would we actually have to declare it as a dependency.

fintelia commented 2 years ago

Under which conditions, do you suppose, will we ever dynamically choose the rayon-utilizing code path?

If the rayon feature is enabled and rayon::current_thread_index().is_none().

HeroicKatora commented 2 years ago

No, that's not enough. Then any rayon-created tasks will initialize and spawn on the implicit global thread pool instead, which has the exact same problem as any local one (it has a constant number of threads) (based on num-cpus/some environment variable afaik). Again, the local one in the reproduction is sufficient but not necessary, it just turns out to be sufficient enough for a deterministic reproduction (which we can actually turn into a regression test).

awused commented 2 years ago

Under which conditions, do you suppose, will we ever dynamically choose the rayon-utilizing code path? The whole point is that we can never safely do so.

Whenever the rayon::current_thread_index().is_some() call returns false, it is safe to do so, or at least as safe as ever using rayon will be if you're not managing your own pools. Local or global threadpools don't matter, you're always free to block the thread you're called from.

Poking into the code a bit more, it seems it assumes rayon::spawn is analogous to thread::spawn, and spawns long-running tasks that await more work on channels, so there may be an additional requirement on the minimum number of threads in the pool dedicated to each decoder. If I'm reading this code right, using the global rayon threadpool was never safe, and using a shared rayon threadpool between multiple decode calls will also be unsafe. This pattern would be better served by tokio or another async executor rather than rayon.

Again, the local one in the reproduction is sufficient but not necessary, it just turns out to be sufficient enough for a deterministic reproduction (which we can actually turn into a regression test).

I can see two other simple tests. Use build_global to change the global threadpool and then use decode from both inside and outside that pool.

fintelia commented 2 years ago

Ah, the problem is specifically that the spawned tasks also block: https://github.com/image-rs/jpeg-decoder/blob/222c2646151231ceea0332c50aab850c885f6033/src/worker/multithreaded.rs#L64

image-rs / jpeg-decoder

rayon feature causes deadlocks when used inside of an existing rayon threadpool #227