Research on IO-bound tasks

mratsim commented 4 years ago

Weave / Project Picasso focuses on CPU-bound tasks, i.e. those are non-blocking and you can throw more CPU at it to have your result faster.

For IO-bound tasks the idea was to defer to specialized libraries like asyncdispatch and Chronos that use OS primitives (epoll/IOCP/kqueue) to handle IO efficiently.

However even for compute bound tasks we will have to deal with IO latencies for example in a distributed system or cluster. So we need a solution to do useful work in the downtime without blocking a whole thread.

That means:

either playing well with asyncdispatch/Chronos (do we run an event loop per thread or one event loop only ...)
or having deeper integration which probably will be better for people that needs both IO and compute but people only needing one or the other pay an extra tax. And the library will get significantly more complex, harder to maintain and with many more platform specific codepaths or even CPU specific in the case of coroutines with stack and register manipulation.

Research

Reduced I/O latencies with Futures

Kyle Singer, Kunal Agrawal, I-Ting Angelina Lee

https://arxiv.org/abs/1906.08239

The paper explores coupling a Cilk-like workstealing runtime with a IO runtime based on Linux epoll and eventfd.
A practical solution to the Cactus Stack Problem

Chaoran Yang, John Mellor-Crummey

http://chaoran.me/assets/pdf/ws-spaa16.pdf

Fibril: https://github.com/chaoran/fibril

While not explicitly mentioning async I/O, the paper and the corresponding Fibril library are using coroutines/fibers-like tasks to achieve fast and extremely low overhead context switching. Coroutines are very efficient building blocks for async IO.

For reference, the overhead is measured by fibonacci(40) which spawns billions of tasks, fibril achieves 130ms, Staccato 180ms, Weave 165-200ms depending on tradeoffs regarding memory management, more established runtimes have much more overhead: TBB 600ms~1s, Clang OpenMP ~2s, Julia Partr ~8s, HPX and GCC OpenMP cannot handle fib(40).

Implementations

Go scheduler mixes IO and compute, though Go is not that known for its compute throughput (probably because goroutines optimize for fairness/latency and not throughput)

https://assets.ctfassets.net/oxjq45e8ilak/48lwQdnyDJr2O64KUsUB5V/5d8343da0119045c4b26eb65a83e786f/100545_516729073_DMITRII_VIUKOV_Go_scheduler_Implementing_language_with_lightweight_concurrency.pdf
Julia PARTR mixes IO via libuv event loop per thread and a Parallel Depth First Scheduler: https://github.com/JuliaLang/julia/blob/f814301bd9503e243276b356d0cdbfcaa5ae0b8a/src/partr.c#L260-L295
boost::fiber has work-stealable fibers

https://www.boost.org/doc/libs/1_71_0/libs/fiber/doc/html/index.html

mratsim commented 4 years ago

Also Rust coupled both IO and a task parallelism in the past and decided to avoid that:

https://github.com/aturon/rfcs/blob/remove-runtime/active/0000-remove-runtime.md

https://github.com/rust-lang/rfcs/blob/master/text/0230-remove-runtime.md

mratsim commented 4 years ago

An idea on how to play well with Asyncdispatch or Chronos or any future async/await library.

They all offer a poll() function that runs their event loop.

We can add a field pollHook*: proc() {.nimcall, gcsafe.} on each worker. It would be setup by setPollingFunction(_: typedesc[Weave], poll: proc() {.nimcall, gcsafe.}) before Weave initialization (at first, can be relaxed later).

Then we modify loadBalance(), sync(), syncScope() to interleave pollHook calls before and after executing a task.

Note that loadBalance() is called in-between each parallelFor iterations:

If loop is fine-grained, even if it's executed on the main thread, there are plenty of opportunities to handle IO event.
If there is no hook, the if not pollHook.isNil: is very predictable and should be costless
If there is a hook, the syscalls to handle the IO event will probably slow down fine-grained parallelism a lot and also completely flush the CPU caches with data loaded by the kernel. This is bad, there are some ways to "mitigate" that:
- Document the tradeoff
- Only install the hook on the main thread, so that worker threads are not polluted
- Give the option to install the hook on either the main thread or all thread.

Note that worker threads will sleep if they have no tasks, but it does not make sense for them to try to handle IO events without a task.

A potential issue is that a task can be migrated or for a parallel loop, it can even be split and then executed on 2 different threads, i.e. are the async libraries using {.threadvar.} to manage some global state? Because that will not work.

mratsim commented 4 years ago

RFC #132 and its implementation with Weave as an independent background service is probably a better path forward #136

mratsim commented 1 year ago

See https://github.com/weavers-guild/weave-io

mratsim / weave

Research on IO-bound tasks #22

Research

Implementations