rayon-rs / rayon

Rayon: A data parallelism library for Rust
Apache License 2.0
10.97k stars 497 forks source link

Allow using parallel iterators on with a specific thread pool without install #598

Open nical opened 6 years ago

nical commented 6 years ago

As far as I understand, the current way to use parallel iterators with a dedicated thread pool is to use ThreadPool::install and use the iterator inside of the provided callback, or use ThreadPool::spawn without blocking the calling thread. Either no work starts before a worker is available.

I haven't found a way to get the parallel iterator to start work on the calling thread and let workers of a specific thread pool help out through work-stealing. The problem we are running into with install, is that some times the workers are already busy with very heavy workloads and in our situation, letting the caller do all of the work is much better than waiting for the one of the workers to pick up the installed callback.

An API similar to install, that would override the global thread pool within a callback while still running the callback on the calling thread would be useful in this scenario (if feasible).

Edit: Actually, I might have misunderstood how rayon works, and it might be that there is no way for a non-worker thread to start working and have work stolen from it regardless of which thread pool it interacts with).

bholley commented 6 years ago

@nikomatsakis I think we touched on this topic at some point last year. Would it be feasible for rayon to effectively treat the calling thread as part of the pool? That would avoid the mandatory context switch in a lot of cases.

nikomatsakis commented 6 years ago

Hmm. It's not really obvious how to implement this. @cuviper and I sketched out a few possible strategies but didn't find anything too satisfying. Here are some notes:

If you were to enqueue work in the global work queue, and then you went on and did it yourself, we would need to figure out how to remove that work again. We can't directly remove things from the queue because of the limitations of non-blocking data structures. If you allocate an Arc or something you could do some atomic operations to remove the work while leaving a "thunk" behind for others to encounter. But that's pushing an allocation onto the path.

On the other hand, you might imagine that we have a notion of how busy the threads are and — if they seem "too busy" — that we refuse to enqueue the work item in the first place. Then you could go off and do some work on the other thread and come back and try to enqueue again later. It's a decent idea but not obvious how to implement the heuristic and probably has some nasty cliffs too.

A variant on the above idea is to try and consider the overall size of the problem as well, so that we execute sequentially more frequently.

Or we might try viewing the problem differently in other ways. For example, you could imagine having more pools (for the various cases) and viewing this as a balancing problem between pools, or perhaps using priorities.

bholley commented 6 years ago

@nikomatsakis I'm not totally sure I follow. IIUC, the basic principle of work stealing is that you put the work in your own queue and start processing it, and workers come and steal it if they're idle. Why can't the calling thread opt into having its own queue that the workers could steal from, just like workers steal from each other?

cuviper commented 6 years ago

@bholley For one, that requires a way to tell the threads about an additional queue they should steal from, as well as removing access to that queue before we can return from the caller's scope. That's not impossible, but might require a fair amount of synchronization.

bholley commented 6 years ago

A naive proposal would be: allocate one additional slot (so N+1 slots for N workers), for the caller's queue, which is generally None. When invoking via rayon, the caller does an atomic-compare-and-swap to insert its queue into the pool. If it doesn't work (because a caller on another thread is also using the pool), we fall back to the current behavior. If it does work, the caller pushes the work into its own queue instead and starts executing.

nical commented 6 years ago

If dynamically inserting queues adds a toll on the overall system, an alternative scheme could be to have a fixed number of caller queues that are decided at initialization time that would be shared between any number of callers, using a blocking implementation to insert jobs (in the hope that contention on the insertion lock will be rare and at worse roughly as expensive as blocking the caller as is currently done when installing jobs).