Standardised API for sharing thread pools

Wodann commented 4 years ago

In the working group meeting #67, @kabergstrom mentioned that several crates that use threads pools, use the OS to handle time slicing (e.g. Rayon, Tokio) and as such are at risk of falling outside of the Rust game ecosystem. More concretely, a solution would: let the user have control over multiplexing executor work onto OS threads.

To resolve this issue, they proposed designing a standardised API for sharing thread pools in the spirit of raw-window-handle.

There is a Reddit discussion in which we are gauging interest.

Wodann commented 4 years ago

Would SpawnExt cover our needs?

Wodann commented 4 years ago

@repi If the information was relayed correctly, having crates implement a trait like this would solve your issues with crates that implement their own executor. Do you have any concerns that are not covered by the proposed trait?

msiglreith commented 4 years ago

As gamedev wg it would be great to also discuss the needs specially for engines and games. In particular, how to handle priorities of certain tasks and potential pinning to specific threads, which would not be covered by the current SpawnExt trait imo.

kabergstrom commented 4 years ago

I have a rough proposal for an API. The idea is to provide an API that lets the user be in control of when each executor crate runs work, to provide some control around time budgets and to let the user be in control of how executors are multiplexed on OS threads.

/// This is implemented by tokio/rayon/async-std for their executors.
/// The user builds as many workers as is desired, and places them onto threads as desired.
trait WorkerBuilder {
    /// Builds a worker that can notify the user of work available by using the provided Waker
    fn build_worker(waker: Waker) -> Box<dyn Worker>;
}
/// Implemented by tokio/rayon/async-std. This is the executor itself, which polls futures or runs queued tasks.
trait Worker { 
     /// Polls the worker, doing work if available.
    /// The time_budget argument indicates the caller's desire for the executor to finish within the duration
    /// May return a Duration that indicates the worker's desire to be polled again at the expiration of the duration
    fn poll(&mut self, time_budget: Option<Duration>) -> Option<Duration>;
}

... create Workers and spawn worker_threads...

usage:
fn worker_thread(parker: Parker, workers: Vec<Worker>) {
    loop {
        // in more complex cases, you may want to prioritize work as Wodann said, 
        // and only poll the Workers that are most important, for example frame job workers.
        for worker in workers {
            // should keep track of each worker's wakeup timeout desire and wake as appropriate
            worker.poll(Some(worker_time_budget)); 
        } 
        // The Wakers provided to Workers would unpark this Parker
        parker.park();
    }
}

aclysma commented 4 years ago

After lots of good discussion in Discord and thinking about it a bit, I think this is a much more difficult problem than what raw-window-handle addresses. I do also perceive that there is some risk of ecosystem split, so I'd like to see something done, but I don't think a solution will come easily. It seems like even defining the problem in a way that everyone completely agrees with is difficult.

Case in point, I was thinking of the problem differently than @kabergstrom. As I understand it, his proposal inserts extensibility at a different layer of the stack than what I had in mind. I think both approaches could be useful and are fairly orthogonal.

The problem as I had it in mind was that many crates send their work directly to a thread pool implementation. So for example: [Specs/Shred] -> [Rayon] -> [Hardware Threads]

In this example, specs/shred is strongly coupled to rayon. AFAIK there isn't a way to have the work sent to tokio or some other executor.

@bitshifter mentioned PhysX has a solution for this: https://gameworksdocs.nvidia.com/PhysX/4.1/documentation/physxguide/Manual/Threading.html#cpudispatcher

At first I was thinking we could recommend crates offer an API like this, but this could end up being quite a lot of work for people maintaining them. Crates like rayon are really pleasant and easy to use, allowing code like this: (0..100).into_par_iter().for_each(|x| println!("{:?}", x)) I don't think we would be successful asking people to change from that to rolling their own task delegation layer.

I also think there is potentially a lot of diversity in what kinds of tasks a crate can produce. Tasks could be long/short-running, low/high priority, IO/CPU bound. Sometimes an end-user will want the work generated by an upstream crate to be pinned to a particular thread. Sometimes it's important to allow tasks to stack up to create back pressure and slow down the amount of work an upstream crate is producing. Some tasks are fire-and-forget, and other block code that needs to run immediately after the work is done, possibly using a result from the tasks. Different games might even need to handle work coming from the same upstream crates differently.

So even if upstream crates had a task delegation layer like PhysX, they'd probably have their own small differences, for good reason.

While a utility crate could probably be created to help upstream crates add a task delegation layer, I think it would be difficult to come up with a single interface that expresses every possible usage an upstream crate might need. The communication is actually bidirectional - the crate generating the work has to express what to do, and also be able to listen for a result.

As I mentioned before, this is different from @kabergstrom's approach. I don't think one is better than the other, and I could see both approaches being used at the same time.

Whatever we do, I think it will need to be prototyped and experimented with, and the process won't be as quick and easy as it was for raw-window-handle.

bitshifter commented 4 years ago

@aclysma I would see this as an internal detail that would not change the user level API of any crate. For example, PhysX doesn't require you to implement their CPU dispatcher API, they provide a default implementation and it doesn't change the high level use of the library. I wouldn't expect this kind of interface to change rayon any more than their current ThreadPool interface which is used behind the scenes. It is possible each library would have differing requirements making supporting a common interface difficult however. To determine that though someone would need to audit crates that have their own thread pools and what kind of features those thread pools use.

msiglreith commented 4 years ago

Proposal

This proposes a first approach regarding pushing context information from the call site over to libraries. Therefore, this proposal focuses on the library interface only based on the following assumption:

For the caller of a lib function it is sufficient to provide task relevant data at this level of abstraction (e.g high level library task won't spawn low level library tasks).

This allows to split the issue of providing an API into two parts:

Define a guideline on how to define library APIs to pass context specific data
Provide a common trait for specific tasks

Practical Part

IMO the issue of defining a task API is similar to passing custom allocators down to libraries. Which leads to point 1 being the same for both issues (task & allocator), while the 2nd is specific to the problem.

To tackle the 1st point, the proposal would be to create suballocators and subexecutors (let's call them context) by the caller and pass these to the library

Example

let low_task_executor = main_executor.low_priority();
entities.par_iter(&low_task_executor).for_each(|x| { .. });

let linear_allocator = main_allocator.get_linear_allocator(..);
renderer.set_allocator(linear_allocator);

The low_task_executor would implement a common Executor trait and linear_allocator a common Alloc trait.

Pros/Cons

Pros:

Doesn't require a #[global_executor] or further language support
Quite flexible, but also allows 'simple' interfaces for libraries

Cons:

Possible limitations for library creators
Can be quite verbose to pass these around and complicates the API (there are similar issues when designing UI APIs..)

skade commented 4 years ago

Hi! I was pointed to this discussion and was wondering how the async-std team could help there, potentially working against a ecosystem split. We have also been in touch with other groups around special execution needs, e.g. media streaming.

A little known fact about async-std is that it comes in 3 pieces:

async-task, a general purpose task allocator, shipped as a library.
The main API for IO handling
The runtime

If you compile async-std without the "runtime" flag, you basically get a hollow interface. That allows you to ship your own variant of it, better tuned to your use-cases. This runtime could have specialised spawning interfaces, fulfilling your needs better.

async-std is built with the idea that you may need to choose your own execution model and also gives you ready-made tools do build your own executor. It's default implementation hides all that and gives you no access to the internal runtime, but that also gives you the ability to move to something more special and better geared towards your environment, while not breaking depending libraries.

We'd be very interested in talking about the problem of libraries not abstracting over executors and not being prepared for the presence of multiple executors and want to spend time designing there.

Wodann commented 4 years ago

We already have several proposals that we want to prototype with, but as discussed in the wg meeting it'd be good to know the use cases that the prototype API should test:

Don't leave time slicing to the OS (@kabergstrom)

If any use cases are missing, please list them.

Lokathor commented 4 years ago

There's a new Repo for the prototypes to be collected into: https://github.com/rust-gamedev/thread-pool-api-prototypes

msiglreith commented 4 years ago

Job systems in the wild with focus on the executor part (excludes data dependencies, high level scheduling over multiple frames etc) with a short description:

Name	Reference	Description
Parallelizing the Naughty Dog engine using fibers	Slides	- Using fiber based system (rough scale: OS threads (1-10) -> pool of fibers (10-100) -> jobs (100-1000)). - Requires knowledge of the executor or rather execution context by the jobs due to sync primitives (also an issue with futures in general!). - I/O handled in OS threads. - Should allow to spawn jobs inside of jobs and yield to it. Jobs separated into 3 queues based on priorities
Multithreading the Entire Destiny Engine	Video	System specific thread pool layout (PS3 <-> PS4 <-> XBOX1 <-> ..)
Marvel's Spider-Man': A Technical Postmortem	Video at ~2min	2 locked/pinned (?) threads (main and rendering), 4 workers each 3 threads each with different priority (pinned to one core), I/O thread and further ones for audio, physics etc.

(Ideally, the API should not hinder intergration of profiling/debugging middleware like RAD Telemetry)

bitshifter commented 4 years ago

I found another API example of what a thread pool API might look like in C++ land. Another piece of physics middleware, this time the FEMFX library from AMD - https://gpuopen.com/gaming-product/femfx/

The interface appears to be a bunch of function pointers - https://github.com/GPUOpen-Effects/FEMFX/blob/master/amd_femfx/inc/FEMFXTaskSystemInterface.h

You can see an implementation that has compile time support for UE4's task scheduler, Intel TBB and TLTaskSystem which appears to be FEMFX's own implementation of a task system (see https://github.com/GPUOpen-Effects/FEMFX/blob/master/samples/sample_task_system/TLTaskSystem.cpp)

I thought this was another good example demonstrating usage in an AAA major game engine in addition to the PhysX interface I mentioned earlier.

Lokathor commented 4 years ago

https://async.rs/blog/stop-worrying-about-blocking-the-new-async-std-runtime/

This might do too much stuff automatically for it to be considered acceptable by everyone, but it's interesting as a point of reference at least.

Wodann commented 4 years ago

The following blogpost highlights a crate that might cover most of this use case:

https://blog.wnut.pw/2020/02/25/anouncing-async_executors-a-building-block-for-executor-agnostic-libraries/

msiglreith commented 4 years ago

Executor trait interface: https://github.com/bastion-rs/agnostik

rust-gamedev / wg