Project Picasso - a multithreading runtime for Nim

"Good artists borrow, great artists steal." -- Pablo Picasso

Introduction

The Nim destructors and new runtime were introduced to provide a GC-less path forward for Nim libraries and applications where it made sense. One of their explicit use case is making threading easier.

RFC goals

This RFC aims

to present the current challenges and the design space of multithreading runtime.
collect use-cases and discuss goals and non-goals of a multi-threaded runtime.
understand if we need compiler support for some features and if not:
- discuss if we should allow competing runtimes and allow switching just like Nim allows multiple GCs (refcounting, mark-and-sweep, boehm, no gc).
gather some metrics ideas to benchmark runtime systems.
ultimately have people implementing a runtime system or part of (there are plenty of pieces needed)

The problem domain:

The word "thread" had many meanings in the past or words closely related (green threads vs heavy threads, coroutines, fibers, ...).

I.e. threading means how to interleave different routines and their contexts of execution.

This RFC focuses on "heavy" threads as used for computation on multi-core systems.

Why Project Picasso?

The new runtime introduced a borrow-checker and most successful multithreading runtimes uses work-stealing for load balancing. Now re-read the quote :wink:.

Project Picasso - a multithreading runtime for Nim

Reading on Nim related concepts

Araq's blog posts on concurrency (2013)
- https://nim-lang.org/araq/concurrency.html
- https://nim-lang.org/araq/concurrency2.html
Destructors (2017)
- https://nim-lang.org/araq/destructors.html
- https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition
Newruntime, owned refs, borrow-checking (2019)
- https://nim-lang.org/araq/ownedrefs.html
- https://github.com/nim-lang/RFCs/issues/144

Where are we now?

If you want to use multiple cores in Nim you can currently use

Raw threads via createThread (pthreads on Unix, Fibers on Windows)
Threadpool with
- the spawn/^ functions
- The parallel statement
- channels for inter-thread communication
OpenMP with
- The || OpenMP operator for parallel for-loops or task-loops
- Emitting OpenMP blocks

However, I'd argue that

createThread is a too low-level abstraction for most.
The threadpool has contention issues due to using a global queue, and it has no load balancing either.
OpenMP does not supported nested parallelism. The implementation of tasks varies wildly (GCC's uses a global queue as well so load-balancing and contention are issues) and cannot be built upon (for example for task graphs). Furthermore, OpenMP requires going through C/C++ compilation and cannot be used with nlvm or projects that would want to JIT parallel code.

Brief overview of the types of parallelism

There are several kinds of parallelism, some addressed at the hardware level and some addressed at the software level.

Let's start with hardware level not addressed by this RFC:

Instruction-Level Parallelism:

Modern superscalar processors have multiple execution ports and can schedule multiple instructions at the same time if they don't use the same port and there is no data dependency

SIMD: Single Instruction Multiple Data:

Often called vectorization, this is SSE, AVX, etc: one instruction but that applies to a vector of 4x, 8x, 16x integers or floats.

SIMT: Single Instruction Multiple Threads:

That is the threading model of a GPU. Threads are organized at the level of a Warp (Nvidia) or Wavefront (AMD) and they all execute the same instructions (with obvious bad implications for branching code).

SMT: Simultaneous Multi-Threading:

In Intel speak "Hyperthreading". While a superscalar processor can execute multiple instructions in parallel, it will sometimes get idle due to instruction latency or waiting for memory. One way to reclaim performance for a limited increase in chip size is with HyperThreading with each physical cores having 2 (usually) to 4 logical cores siblings (Xeon Phi) that can use the same hardware resources to execute multiple threads.

Further information: https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

What are we interested in?

Exploiting multiple cores:

Recent laptops now ship with 4 cores, even phones ship with 4 cores, we need to provides tools for the devs to use them.

At the software level

Data parallelism:

The easy part, you work on elements and your operation maps to the same operation on all elements. For, incrementing all elements of an array by one.

Task Parallelism:

The complex part, you have tasks (jobs) that are usually different in terms of computation, resources, time required but can be scheduled in parallel. Those can produce new tasks. For example, issuing a parallel search on an unbalanced tree data-structure.

What are we less interested in

Stream Parallelism:

You have a data stream and apply a pipeline of transformations on it, possibly with forks in the stream and joins. An example would be a parallel iterator library or a parallel stream program that takes an input compressed image archive, decompresses it, applies transformations to some images and then recompress those in a new archive.

I believe that stream parallelism is sufficiently similar to data parallelism and task graphs that addressing data and task parallelism will make stream processing much easier.

Use-cases

I will need your help for this section. Some obvious needs are:

spawn computeIntensiveTask() (Task-parallelism)
Array processing in numerical computing (Data parallelism)

In both cases parallelism can be nested if a parallel Nim library calls another parallel Nim library. The system should behave properly if a parallel GUI calls a parallel image library for example.

API

Having good features will draw people, having good APIs will make them stay.

Here is an overview of the design space.

Data parallelism only needs 5 primitives:

parallel section (to setup thread local values)
parallel for
parallel reduce
barrier
critical section

Task parallelism has much more needs:

spawning a new job
Representing a future value with Flowvar
blocking (^) until the child task is finished
alternatively polling with isReady
scheduling continuations
cancel a computation (user changed image on the GUI so compute is cancelled)

As you can see there is a lot of parallel with async/await IO. This is probably a good thing, i.e. use async/await for blocking IO and spawn/^ for non-blocking compute.

For the rest, I will assume that threads are too low-level of an abstraction and that parallel annotation (for data parallelism) and tasks (for task parallelism) are much easier and more natural to manipulate for a developer. A runtime system should figure how to distribute those on the hardware.

Furthermore, data parallel primitives can be expressed in terms of task primitives so I will focus on tasks.

On the non-obvious choices, there is:

How to communicate between threads
- Message passing (i.e. Channels): Share by communicating instead of communicate by sharing (from Rust and Go)
- Shared memory:
- atomics and locks
For channels:
- Have an object shared by producer(s) and consumer(s)
- Have a Sender object and a Receiver object that statically ensure that it's correctly used
How to represent a task:
- An object
- A concept/interface/trait
- A closure (that captures its context)
- A pure function
- Note that the choice may have impact on:
- Nim DLLs
- C interface, which is valuable for Nim as a Python backend or for JIT code to tie back to Nim.
- Hot-code reloading
An error model:
- No exceptions in the runtime, unless we know have thread-safe exceptions
- Error codes
- If yes, we need a spawn that accepts a Flowvar for in-place modification
- Options?
- A richer API like nim-result
- Note that Nim enums can use strings
```
type PicassoError = enum
Ok = "All is well"
ThreadMemError = "Could not allocate memory to create a thread"
TaskMemError = "Could not allocate memory to create a task"
AlreadyCancelledError = "Task was cancelled"
```
  And those can be preformatted for printf
TaskmemError = "Thread %d: could not allocate memory to create a task"
How to ensure composition?
How to transfer ownership between threads?
Are there use cases where lower-level access to the threadpool is desirable?

In terms of robustness:

message passing benefits from CSP (Communicating Sequential Process), which provides a formal verification framework for concurrent system that communicates via channels
Haskell inspired C# with the Continuation Monad. If there is one thing that Haskell does well it's composition, and also having a solid type system.

Load-balancing

Work-stealing won both in theory and in practice. It has been proven asymptotically optimal in terms of performance.

However there are plenty of implementation subtleties that can have heavy influence on workloads:

What to do after spawning work:
- Help-first: continue on the current execution context (also called child-stealing). Breadth-first task creation: on a single-thread context, with a for loop for N tasks, N tasks will be created and live before the thread will do the job one by one.
- Work-first: jump on the freshly spawned work (also called parent-stealing or continuation stealing). This requires compiler support similar to coroutines for restoring stackframes. Breadth-first task creation: on a single-thread context, with a for loop for N tasks, only 1 task will be live resolved before the thread goes to the next.
Steal one tasks vs Steal half tasks
Leapfrogging: work-stealing allows an idle worker to steal from a busy one, but what if a busy worker is blocked by an unresolved Flowvar? Allowing it to continue instead of blocking is called leapfrogging
Loop splitting: some tasks include loops which for efficiency reasons are not split in a task for each element. But when a loop is big, it might be worth it to split it to allow other worker threads to steal it. Except that the operation within a loop might be either very cheap or very costly so the "grain"-size matter, and adaptative splitting would be very nice.
Hierarchical work-stealing: high-end processors like AMD Threadripper or Intel Xeon Bronze/Silver/Gold/Platinum have a Non-Unified Memory Architecture (NUMA). Meaning they have significantly more affinity with the memory directly attached to their cores and accessing "far" memory causes a significant penalty. In that case it is important to only steal work corresponding to the local fast memory.
CPU consumption and latency: when a worker finds no work, does it poll, how frequently, does it yield?
How to select theft victims?
How to detect work termination?

Interested and not feeling overwhelmed yet? I have gathered an extensive litterature in my research repo.

Scheduler implementation

Like the choice of communication between threads, for synchronization as scheduler needs to choose between:

Shared memory
Message passing
Software Transactional Memory (database like commits and rollback based on transaction logs)

While the traditional focus has been shared memory, involving atomics and locks. I read and ported the code of a very inspirational Message Passing based work-stealing scheduler thesis in my experimental repo.

Haskell is the only production grade user of Software Transactional Memory. It has caught C++ interest, here is a good overview of the model and the C++ proposal sponsored by Michael and Scott (from the Michael-Scott concurrent queue fame). One of the main difficulties with STM is that you cannot replay side-effects.

Note that for scheduler implementation all three strategies can be formally verified as the synchronization between threads is done through a very specific data structure:

Shared Memory via Chase-Lev Deque see proof on weak memory model
Message Passing via Channels (see CSP article)
STM via transaction logs

Also all 3 already had hardware support in the past (in either experimental hardware for message passing or buggy hardware for transactional memory).

Which brings us to ...

Hardware

The hardware we choose to target will greatly influence the runtime.

Scheduling for a weak memory model like ARM, strong memory model like x86, a workstation with 2 CPUs or a cluster for distributed computing.

For example, the Cell processor (for Playstation 3) made it impossible to implement efficient concurrent data structure. Or shared memory is impossible for distributed computing or heterogeneous architecture with GPU nodes.

Messaging-passing is often associated with overhead.

Hardware transactional memory is only supported on recent Intel chips and GCC-only and was notoriously buggy for 3 chip generations (Ivy Bridge, Haswell, Broadwell).

Note that in all cases, implementation "details" matter a lot and message passing can be as fast as shared-memory as shown by my proof-of-concept channel-based work stealing scheduler.

Let's talk about the biggest implementation "detail".

Memory

For compute intensive operations the bottleneck is often not the CPU GFlop/s but the memory to keep the processor fed with data to process. This has been captured by the roofline model and the notion of arithmetic intensity (ratio of compute operations / bytes needed to carry it). Only operations with high arithmetic intensity can use the CPU at 100%, most are bottlenecked by memory and can use 10-20% of the compute.

This means that memory locality and efficient memory allocation and reuse is key: memory pools, object pools, stack arrays with alloca, ...

Also for NUMA architecture, a NUMA aware allocator would be helpful.

I.e. concurrent data structures should probably accept an "allocator" argument.

Extras

Some extras that are not in scope but interesting nonetheless

relation with the async/await event loops
fiber/coroutine pools as in Boost::fibers or the Naughty Dogs presentation (video and slides
Task Graphs
Dealing with GC types (as GC will still be useful)
Mapping with GPU: beyond the obvious offloading of for-loops to GPU, Cuda and OpenCL provides a async stream and event API to offload, provide continuations and then block or poll until the computation stream has finished.

Benchmarking

Once we have designed our unicorn™, we need to make sure it fits our performance requirements, its overhead, its scalability and how it fares against other close-to-metal language.

Here are a couple of ideas:

Runtime overhead (Task Parallelism): A recursive fibonacci benchmark will quickly tell how much overhead the framework has because the task is completely trivial. It will also tell us the scalability of the task system as the number of tasks grows at 2^N. Key for performance:
- Memory allocators
- Having distributed task queues/deques to limit contention
High-performance computing (Data Parallelism) I have implemented a matrix multiplication in pure Nim as fast as industry-standard OpenBLAS, which is Assembly + raw pthreads. It requires 2 nested parallel for loop and can also be called from outside parallel regions as it's a basic building block for many scientific and machine learning workloads. Key for performance:
- As long as the matrix multiplication is well implemented it's an easy task as workload is completely balanced (no need for stealing), tasks are long-running (work is much bigger than overhead) and complex enough to maximize compute as long as memory is fast enough.
- Thread pinning will help a lot as it is very memory intensive and optimizations are done to keep data in L1, L2 caches and the TLB
- Being aware of and not using hyperthreading will help because otherwise the physical core will be bottlenecked by memory bandwith to retrieve data from 2 threads operating on different matrix sections. Extra: would be to test on a NUMA machine.
Load balancing (Task parallelism) Tree algorithms creates a lot of tasks but if the tree is unbalanced idle workers will need to find new work. An example use-case is Monte-Carlo Tree Search used in Decision Processes and Reinforcement Learning for games IA and recently in finance. In short, you launch simulation on diffrent branches on a tree, stopping if one is not deemed interesting but searching deeper on interesting branches. The Unbalanced Tree Search benchmarks is described in this paper. Key for performance:
- load balancing
Energy usage (Task parallelism):

When workers find no worker they should not uselessly consume CPU. A backoff mechanism is needed that still preserve latency if new work is suddenly available. A benchmark of energy usage while idle can be done by just checking the cpuTime (not epochTime/wallTime) of a workload with a single long task compared to serial.
Single loop generating tasks (Task Parallelism)

Such a benchmark will challenge the runtime to bundle or potential split work with incoming steal requests. This stresses how many consumers a single producer can sustain, see Nim implementation.
A divide-and-conquer benchmark like parallel sort
Black-and-scholes: The Black-and-Scholes equation is the building block of financial modeling.
Wavefront scheduling (Task Graphs) wavefront is a pattern that often emerges in image processing when after computing pixel [i, j], you can compute pixels [[i+1, j], [i, j+1]], then [[i+2, j], [i+1, j+1], [i, j+2]]. This is also a key optimization for recurrent neural networks (Nvidia optimization blog - step 3).

Community challenges

Let's go back from the nitty-gritty details and look into the challenge for Nim.

Given the breadth of the needs and design space: do we want to allow multiple libraries, do we try our hands at a one-size fits all?
- Example: real-time system and games might want scheduling with a priority queues which are hard to make concurrent and I'm not even sure about work-stealable.
Assuming we allow multiple libraries, how to make sure end-users can use one or the other with minimal cost, does the standard library enforce an interface/concept?
When do we ship it?

I hope you enjoyed the read.

TL;DR: Designing a multithreading runtime involve many choices, probably some conflicting ones in terms of performance, ergonomy, complexity, theoretical properties (formal verification) and hardware support.

Wow, I can see you spent a lot of work into this PR. Yes threading is something we should improve in Nim and this document is great work.

To my experience a good API only bubbles up, when we have a task where we want to use this API. Creating a multithreading API in isolation doesn't work. So my suggestion is that we also need to specify a problem that this API should be able to solve. This problem should be an interesting problem to solve, not these micro benchmarks that sort a list of random integers in parallel., and then we can see how well the patterns work out.

I am saying this, because the road to blender success were the open movie projects. These projects helped the blender developers to focus on what is really important. They were not made for the sake of making a movie, they were made to improve on Blender. So I think we need the equivalent of a Blender Open Moive for Nim to improve upon the Multithreading functionality.

Minor pet peeve - in terms of binomial trees for option pricing, there really are many numerical methods that scale better (e.g. Broadie & Detemple 1996 https://sci-hub.tw/10.1093/rfs/9.4.1211, but there is a whole cottage industry of methods here - BBSR is just super-intuitive/easy to explain to anyone at all familiar with the basic problem).

So, to me, at best that feels more like the Fibonacci benchmark - "a well known but bad way to get an answer", (differing only in that it seems to be much less well known how bad a way, hence this comment).

Great proposal.

Asked this on Gitter but I thought I should restate it here: w/r/t the load-balancing and the hardware section, I'm curious what precisely do you intend here: do you want to approach some sort of compile-time/init-time fine-tuning for the underlying implementation or just an idealized general implementation?

indeed very impressive. A "one-size fits all" will be hard to get. Think you will need very much metadata at compile-time to get the best out of the hardware and the business-case. But now I started to dream about a "RTOS-less" runtime - a lightweight, easy to portable HAL (in Nim?) and Nim....

Edited to add the table of contents and a section of the parallelism options that we have currently in Nim (createThread, threadpool, OpenMP).

@krux02 I do have non-toy uses:

Data Parallelism:
- I can switch all OpenMP uses in Arraymancer and Laser to the Picasso runtime provided the performance is there.
Task Parallelism:
- cryptographic signatures and verifications are our number one bottleneck (~30% processing time) for Ethereum 2 at Status, being able to process multiple in parallel would be very helpful. As they come asynchronously from the network an API similar to async/await but for compute would be a great match.
- DAG and tree algorithms power several state-of-the-art machine learning and reinforcement learning algorithms:
- Gradient Boosted Trees for everything but perception (vision/test/sound) i.e. predicting price, quantity, web visits, subscriptions, sales, ...
- Beam search to improve natural language models
- Computation graphs for neural networks
- Monte-Carlo Tree Search for robotics, game AI and reinforcement learning

Regarding non-toy uses outside of my expertise, I'd say a parallel ray-tracer (see ray-tracing in one weekend) would be pretty nice and it's visual.

@c-blake noted. Your link is dead though.

@awr1 @mikra01: For now, I have an implementation idea that should cover from phones/raspberry Pi to laptops to single-socket workstations. I think a library with a set of APIs that you can import picasso is the most flexible (instead of compiler builtins like spawn and parallel) and when you want something suited for NUMA or distributed computing you can do import picasso_distributed. Regarding compile-time fine-tuning, there are not a lot to fine-tune at compile-time for a mature library. There will be:

the array sizes, for example Nim threadpool and my PoC can hold up to 256 threads, but it's overkill in most cases
maybe the polling/yielding frequency.
the cache line size for padding to prevent false sharing. It's 64 bytes for almost all common platforms except Samsung phones that use 128 bytes
toggling profiling and multithreading asserts
memory allocator related (like use jemalloc, tcmalloc, a custom pool_allocator when dynamic is not allowed, the object pools size)

For research purposes it will be implementation dependent, in my PoC, you can tune: loop split and work stealing strategy, the number of outstanding steal requests.

At runtime there are some things to detect at init time:

an environment variable with the max number of threads to use
find hyperthread siblings, and allow at init a parameter for not using hyperthreads
NUMA (if we go that far)

Due to work-stealing adaptative nature, good defaults should cover from the current dual-core to 22-core single socket machines.

@mratsim - huh, works for me. Taiwan may be getting blocked for you?

For the curious there's also a book 10 years later covering the paper and amazon's preview (for me) covers and mentions the basic idea https://www.amazon.com/dp/158488567X/ref=rdr_ext_tmb { which is simply use the closed form BS formula for the very last time period to smooth convergence to the point that Richardson extrapolation is workable..In a sense two distinct extrapolate-to-zero-stepsize tricks cooperate to squash error(stepsize|N steps) }. That paper also gives pseudocode for linear-in-memory storage -- also too uncommon, making even 1000 level "trees" fully L1-resident. It's by no means the last word in pricing models/algo efficiency, but it shows how almost trivial upgrades from a defining recursion shift scaling in time/space, much as Newton's method vs. binary search { or naive Fibonacci vs memoized/array/matrix/formulaic, etc., but I realize inefficiency may be "the point" as with Fibonacci }.

@c-blake I can't see the link either. I see the left border decoration with sci-hub, but everything on the right is like a broken link.

@mratsim Regarding the ray tracer. I only implemented toy ray tracers so far, and for that the fragment shader is enough. A computation kernel that runs for every pixel on the screen is exactly what you need for a ray tracer. If you want to go for photo realism with light bounces everywhere this approach has it's limits though.

Huh. For me it auto-downloads the PDF (which is the real content, not any HTML). You can also try just http://sci-hub.tw/ and manually enter the DOI (which is "10.1093/rfs/9.4.1211"). You may also need javascript enabled for sci-hub.tw & maybe cyber.sci-hub.tw? It looks like this is a direct link to the pdf https://dacemirror.sci-hub.tw/journal-article/9bc250c4bfa2abe7c3ee89ed32b59609/broadie1996.pdf . It is a nice introductory survey paper for the field (as it was 25 years ago) with charts, graphs, pseudocode, etc. as well as a great practical example of Richardson extrapolation. Anyway, apologies if it's hard to access! I didn't think it would be.

This is a great collection of knowledge. I've let my HPC knowledge decay but I'm still interested in it and as I learn nim I think about how powerful it could be for HPC. If I was going to take on a hobby project to refamiliarize myself I'd try to write nim bindings or a macro for OpenACC- I'm still such a greenhorn I don't even know which would be more appropriate! In short, OpenACC is an accelerator primitive toolkit that works along certain compilers to parallelize C,C++ and I think Fortran code for accelerators and the CPU. It's like scripty CUDA (it even has async/await) but can run just as fast on AMD, NVIDIA and multicore CPU targets. Now is a great time to bring support to Nim as GCC 9 has really stepped up support for it and AMD recently contributed better backends for their newer cards.

As I find some free time I'll try to parse Picasso and think of a demonstration or experiment to build against. If you've got any thoughts in general I'm curious to hear them.

I have two thoughts here: one is that there are many abstractions nowadays, for example, channels, that can use the so called green threads, or fibers, instead of a full os thread. This is much easier for the user/dev, but perhaps you are speaking solely on the underlying implementation to enable these abstractions. 2) If you think of an operating system, let's say you have two hard drives, and copy files to both of them - the files moved to each can be done at the same time, but if it's to the same hard drive, it's better to be sequential. so this is true for IO of all kinds, be it network connections, storage, or memory operations. Is there a need for some "manager" to collect those requests, and build some kind of a dependency graph, to determine which resources can next be executed into action? Maybe the design should include such an over-seer, as a more complete solution for optimal decisions. 3) ok I have a third thought, the world is trying to parallelize, and nim is compiled to c/c++ - surely there is some very optimized library that you can target. I know of libmill for example, surely there are many others.

@csajedi OpenACC is probably relatively easy.

You can reuse the OpenMP codepath for OpenACC for loops which I touched in those 2 PRs: https://github.com/nim-lang/Nim/pull/10891/files and https://github.com/nim-lang/Nim/pull/9493/files. And for pragma directives (without for loops) you can follow the techniques I use in my future HPC backend.

Anyway for HPC and scientific computing, I have something much better than Picasso planned, you can read more in the markdown files in Laser

@kobi2187

So I indeed thought of fibers, especially due to the inspiration from the Naughty Dogs talk and slides and boost::fiber, however there are 2 caveats:
- It makes the runtime more complex. The abstraction is Task -> scheduler -> thread instead of Task -> Fiber -> scheduler -> thread. I'm not sure there are gains for pure compute tasks.
- The more focused the runtime is, the easier it is to make it play nice with the async/await for IO i.e. this is compute => Picasso and this is IO => async/await.
- Historically fiber multiplexing on a threadpool is called M:N threading and was tried by many languages: Rust, Java, Glibc before being abandoned. See:
  - http://www.kegel.com/c10k.html#1:1
  - https://mail.mozilla.org/pipermail/rust-dev/2013-November/006550.html
The overseer is the Task Graph part I mentioned as out-of-scope. Mature C++ threading libraries offer them, like Intel's TBB Flowgraph and Cpp-Taskflow or even OpenMP via the depend clause. I think task graphs can be build on top of a good low-level task abstraction at a later time. Basically, the difference with the current proposal is that in the proposal you eagerly create tasks while with a task graph, you lazily describe your operations then execute the graph. Conceptually it's the same difference as a dynamic language vs statically compiled one.
Yes there are very optimized C/C++ libraries but the main blocker would probably be how to pass Nim closures to those libraries, this probably would require compiler support. With a pure Nim implementation, the implementation can be done as a library. Furthermore, my proof-of-concept Picasso implementation has similar (on my 2-core laptop) to much less (18-core workstation) overhead than Intel TBB or LLVM OpenMP. This also avoids distribution issues of .dll/.so and would also help a WASM backend.

@mratsim I'm pretty psyched about Laser. I am following along a bit out of sync. I was going to suggest you add it to "Are we scientists yet" as that page drew me in to learn more about Nim for my own HPC stuff. I see you're already in the mix there so I'll just wait patiently. WASM is going to be fire for nim once it matures past the V1 design.

Status update:

I've finished porting a proof-of-concept code from @aprell from C to Nim at https://github.com/mratsim/weave/tree/master/e04_channel_based_work_stealing.

The main proc to use with it are:

tasking_init(), tasking_barrier(), tasking_exit() to initialize, sync and exit the runtime.
async/await (equivalent to Nim's spawn/^)
async_for to launch a parallel for loop

Examples:

Performance

Performance is better than OpenMP and Intel TBB on a fibonacci micro benchmark which is probably the best way to measure framework max overhead (by spawning an exponential number of do almost nothing tasks):

https://github.com/mratsim/weave/tree/master/benchmarks/fibonacci And I can reproduce the performance of the C code for the thesis in the following tests https://github.com/mratsim/weave/tree/master/e04_channel_based_work_stealing/tests

So I expect all the performance figures in the thesis can be ported to Nim.

What i didn't finish

Porting my state-of-the art matmul to the PoC. I didn't add closures support.
Porting the Unbalanced Tree Search reference benchmark that all industry-standard runtime are compared on.

PoC conclusion

For me the PoC is a success.

The principal issue that is not answered is how to yield the thread when there is no work so that it doesn't burn 100% CPU while waiting for work. Yielding should not introduce too much latency on short tasks as well.

Next steps:

Unfortunately I will be pretty busy the next 2 months, so update will be quite quiet, but here are some of the next steps I envisioned:

Create a separate folder for the production implementation
Use Nim names: spawn/^/FlowVar instead of async/await/Future
Create generic basic data structures for the runtime:
- SPSC channel (Single-Producer Single Consumer Channel) used for tasks and futures
- MPSC channel (Multi-Producer Single Consumer Channel) used for steal requests
- Task object
Create an object pool allocator, potentially reusing Staccato very efficient allocator implementation. In the PoC, tasks and channels caching code can be simplified a lot.
Debug newruntime
Reimplementing the runtime on top of those low-level data structures.

A reminder of the unique features of the current PoC.

It uses message passing instead of shared-memory for synchronization even for the runtime.

This is interesting:

from a formal verification point of view
from a hardware point of view, shared memory across 22+ processors is very expensive and hardware makers are at the limit of this model.
from a extensibility and scalability point of view. Traditionally, parallelism was done via shared memory on a computer/node and for distributed computing via message passing between the nodes of a cluster. The current design allows efficient message passing even within the computer. Concretely, just replace the single computer channels implementation with a distributed channel implementation and the runtime would work. (We would still need some locality aware partitioning logic but a skeleton already exists in the PoC)
from an optimization point of view. Many scheduling techniques are difficult or impossible to implement with shared memory or requires very specific atomics (double word compare-and-swap, ...)
- Stealing half the tasks and adaptative stealing of 1 or more tasks
- Lazily splitting iterations of a parallel loop depending on the number of idle workers (and not with heuristics like split-half)

@mratsim that is an interesting work, thanks for pioneering.

I'm curious how would you rate the solution outlined in https://github.com/vlang/v/issues/1868#issue-489433130 to the principal problem you've noticed:

The principal issue that is not answered is how to yield the thread when there is no work so that it doesn't burn 100% CPU while waiting for work. Yielding should not introduce too much latency on short tasks as well.

About the linked solution above - sure, having an SPSC queue "everywhere" due to multiplexing & demultiplexing involves more copying, but it pays off.

@dumblob, I admit I don't fully understand your linked solution employing go/v-routines and SPSC channels for communication, but let me try to summarize the problem with the scheduler that @mratsim is describing: there are worker threads passing steal requests around when running out of work to do. A worker is allowed to retreat insofar as it can stop sending steal requests when it fails to receive work from its peers. What a worker should not do, however, is stop responding to messages, or else it will stall the requesting workers. So as a result, idle workers sit in a tight loop and constantly check for messages, burning CPU.

One potential solution to this problem is to make sure that idle workers don't receive steal requests at all, allowing them to yield when backing off from scheduling. Another potential solution, which I've started to explore recently, is to let steal requests flow as before, but have workers "steal" and handle steal requests on behalf of idle workers that rather back off than listen to messages.

@aprell thanks for the wrap up.

I admit I don't fully understand your linked solution employing go/v-routines and SPSC channels for communication

The SPSC channels are used both for computed results passing as well as communication. The rest is basically description of efficient dataflow programming (see e.g. https://noflojs.org/ for visualization of a similar approach).

there are worker threads passing steal requests around when running out of work to do. A worker is allowed to retreat insofar as it can stop sending steal requests when it fails to receive work from its peers. What a worker should not do, however, is stop responding to messages, or else it will stall the requesting workers. So as a result, idle workers sit in a tight loop and constantly check for messages, burning CPU.

Interesting - on the first glance over the (preliminary) implementation my understanding was slightly different. Thanks for clarifying. Few questions to this principle.

How does this solution passes the computation results between workers?
How is scaling achieved (i.e. detecting the bottle neck worker and consequently spawning several new instances of this bottle neck worker)?
How is "preemptiveness" achieved (i.e. how does the work stealing approach guarantees, that a CPU core will not be occupied with just one worker instance for too long - assuming the occupying worker instance has a lot of work ahead)? If it's by having 1:1 relation between worker instances and os threads, then my question would be how does work stealing handle many thousands of worker instances efficiently?

One potential solution to this problem is to make sure that idle workers don't receive steal requests at all, allowing them to yield when backing off from scheduling.

Sounds reasonable and probably efficient. Why would you need a different solution?

Another potential solution, which I've started to explore recently, is to let steal requests flow as before, but have workers "steal" and handle steal requests on behalf of idle workers that rather back off than listen to messages.

That's an interesting idea - keep us posted what you came up with. One thing came to my mind - basically the workers handling steal requests on behalf of idle workers would take over the role of a scheduler in addition to their main job (or maybe I misunderstood). I actually don't know what an "idle" worker is - I thought, that work stealing works by spawning one-way workers and once they're done, they just discard themselves (incl. freeing memory) meaning that there is nothing like "idle" (unlike in the concept of dataflow programming I've linked above). If this is not the case, feel free to elaborate.

Interesting - on the first glance over the (preliminary) implementation my understanding was slightly different. Thanks for clarifying. Few questions to this principle.
1. How does this solution passes the computation results between workers?

There is a Future/Flowvar channel to send the computation result to the requester.

2. How is scaling achieved (i.e. detecting the bottle neck worker and consequently spawning several new instances of this bottle neck worker)?

In work-stealing you don't spawn new threads, you have a threadpool with threads already created and they have no work to do they either steal work in other worker queues (classic shared memory work stealing) or they request more work (channel/message-passing based work stealing that Picasso/@aprell thesis are doing).

3. How is "preemptiveness" achieved (i.e. how does the work stealing approach guarantees, that a CPU core will not be occupied with just one worker instance for too long - assuming the occupying worker instance has a lot of work ahead)? If it's by having 1:1 relation between worker instances and os threads, then my question would be how does work stealing handle many thousands of worker instances efficiently?

If the task is a loop, the scheduler is capable of splitting the loop if it's annotated as parallel. Otherwise it's up to the developer to make the tasks fine-grained enough, no parallel scheduler can schedule a parallel task without spawn/async/parallel_for annotation.

That's an interesting idea - keep us posted what you came up with. One thing came to my mind - basically the workers handling steal requests on behalf of idle workers would take over the role of a scheduler in addition to their main job (or maybe I misunderstood). I actually don't know what an "idle" worker is - I thought, that work stealing works by spawning one-way workers and once they're done, they just discard themselves (incl. freeing memory) meaning that there is nothing like "idle" (unlike in the concept of dataflow programming I've linked above). If this is not the case, feel free to elaborate.

An idle worker is a thread that has no work to do even after trying to steal or request more work.

Regarding the number of worker threads: If I understand your question correctly, this is something that has not been addressed in work-stealing runtimes. There is a fixed number of (long-running) worker threads that schedule tasks. A worker that backs off is not returned to the thread pool because it must be able to spring back into action quickly. The kind of irregular workloads I was looking at would make this a very difficult problem to solve efficiently, I suppose.

Workers are OS threads, yes. Workers load-balance tasks – function objects or closures (which are not supposed to block on I/O) – unlike workers in the Go runtime, which load-balance goroutines – full-blown user-level threads (please correct me if I'm wrong). Task-parallel computations can spawn millions of short-running tasks. This can make a tasking implementation based on user-level threads impractical.

Sounds reasonable and probably efficient. Why would you need a different solution?

One of my main motivations was to avoid shared state in favor of asynchronous messaging. Workers have no easy way of knowing what other workers are up to, except what can be derived from steal requests. So what sounds like a reasonably straightforward solution would require either shared state or global communication, which I'd like to avoid.

If I understand your question correctly, this is something that has not been addressed in work-stealing runtimes.

Thanks for clarification. From my understanding this also means, that in case I have a pool of 8 worker threads (and a tablet/smartphone with big.LITTLE CPU with 8 cores - 2 of which are the slow ones) and 64 tasks, each being a loop over quite computationally expensive stuff (thus not letting their work to get stolen by other worker thread during one iteration of the loop), then 8 out of those 64 tasks will block all the physical cores and not let others to do their work (not speaking about the issue, that 2 tasks out of those 8 will be way slower than the others due to running on the 2 slowest physical cores). Or is there some hidden (not influencable by programmer) mechanism which would step in (like a preemptive scheduler) and solve it? If not, how would a Nim programmer solve such issue in her code?

Workers are OS threads, yes. Workers load-balance tasks – function objects or closures (which are not supposed to block on I/O) – unlike workers in the Go runtime, which load-balance goroutines – full-blown user-level threads (please correct me if I'm wrong).

Goroutines are actually plain old closures, but due to (IMHO a bit weird) golang scoping rules, blindly accessing anything outside of the goroutine context happens through a reference and is thus unsafe (solution is very simple - write goroutines as kind of pure functions - i.e. passing everything a goroutine shall work with through goroutine arguments).

Task-parallel computations can spawn millions of short-running tasks. This can make a tasking implementation based on user-level threads impractical.

Here I'm lost a bit - CSP-like threads as in golang (aka user-level threads if I understood you correctly) don't store more state than work-stealing runtimes if I'm not mistaken. The amount of synchronization required is also about the same for most scenarios if I'm not mistaken. Why would it make tasking of millions of short-running tasks impractical then? Since Go 1.5 the scheduler for goroutines seems really highly efficient (at least on x64) and IMHO kind of approaches the maximum throughput one could get from tiny tasks among tens of physical CPU cores with different speeds.

One of my main motivations was to avoid shared state in favor of asynchronous messaging. Workers have no easy way of knowing what other workers are up to, except what can be derived from steal requests. So what sounds like a reasonably straightforward solution would require either shared state or global communication, which I'd like to avoid.

Now I understand. Does avoiding shared state also mean, that the scheduling behavior over the application lifetime would need to be fully predicted in advance in compile time? This could make deployments on different machines inefficient. Or would the asynchronous messaging involve also some master scheduler, which would though be consulted (way) less frequently than in the synchronous scenario, but still allow some kind of dynamic accommodation up to a degree?

Yes, this is the crux of explicit communication: workers need to handle messages. If a worker is busy with a long-running computation and "forgets" to handle steal requests, other workers may be stalled, depending on how urgent their steal requests are. (It is possible to send steal requests well ahead of actually running out of work.) Last I checked, interrupting workers worked surprisingly well – at least better than I thought it would – but simple polling is hard to beat in terms of overhead. A hybrid approach could be an interesting solution.

Goroutines are actually plain old closures (...)

Are you sure about that? From what I've read, I'd describe goroutines as "coroutines on steroids": goroutines yield implicitly, for instance, when reading from an empty channel, and are allowed to run in parallel with other goroutines, depending on the number of underlying worker threads (GOMAXPROCS). A goroutine can be suspended on worker A, migrate to worker B, and resume execution there. That's not possible with simple tasks, unless tasks are more like user-level threads.

I don't doubt the efficiency of Go's runtime. It's been a few years since I ran some benchmarks, and Go's runtime has likely improved, but I don't think that goroutines are lightweight enough to be a substitute for (compute-only) tasks in all cases. Goroutines enable concurrency first and foremost, and their scheduling reflects that.

Does avoiding shared state also mean that the scheduling behavior over the application lifetime would need to be fully predicted in advance in compile time?

I'm afraid I don't understand what you mean. Work stealing is a dynamic load balancing technique; it's all about moving tasks at runtime. Whether workers communicate and synchronize using shared memory or message passing is a design decision and implementation detail of the runtime system/library.

Note that task-parallel programs – programs that want to benefit from work stealing – are not constrained in their use of shared memory; they can use shared state however they like. It's just the scheduler that I want to be as independent as possible from relying on shared state, for the reasons @mratsim outlined above.

A hybrid approach could be an interesting solution.

Sounds intriguing, will give it a thought.

Are you sure about that? From what I've read, I'd describe goroutines as "coroutines on steroids": goroutines yield implicitly, for instance, when reading from an empty channel, and are allowed to run in parallel with other goroutines, depending on the number of underlying worker threads (GOMAXPROCS). A goroutine can be suspended on worker A, migrate to worker B, and resume execution there. That's not possible with simple tasks, unless tasks are more like user-level threads.

I don't doubt the efficiency of Go's runtime. It's been a few years since I ran some benchmarks, and Go's runtime has likely improved, but I don't think that goroutines are lightweight enough to be a substitute for (compute-only) tasks in all cases. Goroutines enable concurrency first and foremost, and their scheduling reflects that.

The material you've linked is a nice one. It basically says that go routines are closures (see relations between continuations, closures, and go routines), but with even smaller overhead than I personally thought (as it uses 1KByte fragment stacks instead of one memory page). The material also highlights, that go routines are both preemptively (nothing can get other go routines stuck for long time in one operating system thread) as well as concurrently scheduled (to gain efficiency). The interesting thing I didn't know about is, that they are apparently using work stealing among the operating system threads (at least it is called that way in their presentation and the visuals do strenghten this understanding), but there is no notion about the "burning CPU" issue :cry:.

Would be also interesting to see how much memory tasks consume in this Nim implementation (my first guess would be at least one memory page - i.e. actually several times more than go routines).

I'm afraid I don't understand what you mean. Work stealing is a dynamic load balancing technique; it's all about moving tasks at runtime.

What I meant was how to accommodate to changing "resource availability". I.e. the operating system switches to power saving mode by turning off 90% of CPU cores. Work stealing will be left with tens of operating system threads (aka "work stealing workers") and the efficiency ratio of the application would steeply drop due to this (as there will be more than an order of magnitude more context switches needed).

Another issue is with non-SMP systems (e.g. few super slow CPU cores and more super powerful CPU cores like in big.LITTLE) whereas work stealing seems to not have any (efficient) means how to "move a computationally intensive task" from a worker running such already started task on a super slow CPU core to a worker running on a super powerful CPU core.

Third issue (not related to work stealing in any way, but relevant to Nim multithreading runtime) I can see is, that plain work stealing doesn't provide any higher-level means to dynamically spawn additional instances of (reentrant) tasks in case some task slows down the computation dependency graph (and later kill superfluous instances of such tasks).

This everything seems to be needed to be known in advance (i.e. predicted) in compile time.

The material you've linked is a nice one. It basically says that go routines are closures (see relations between continuations, closures, and go routines), but with even smaller overhead than I personally thought (as it uses 1KByte fragment stacks instead of one memory page). The material also highlights, that go routines are both preemptively (nothing can get other go routines stuck for long time in one operating system thread) as well as concurrently scheduled (to gain efficiency). The interesting thing I didn't know about is, that they are apparently using work stealing among the operating system threads (at least it is called that way in their presentation and the visuals do strenghten this understanding), but there is no notion about the "burning CPU" issue 😢.

The burning CPU issue is that when a thread is waiting for work, it can:

spinlocks on an empty queue/channel until it finds work to steal/receives work. This burns CPU.
pthread_yield or the equivalent in Windows, which says to the OS, I have no work to do, use my time-slice for other threads
wait on a condition variable (which is a high level construct around yield)

Regarding closures, all threading frameworks are using them: Intel TBB, Apple Grand Central Dispatch, OpenMP... The main reason is that it makes for a much easier API for developers. Creating a thread always creates a thread local stack, which you can leave with default values or configure which Nim does: https://github.com/nim-lang/Nim/blob/v0.20.2/lib/system/threads.nim#L48-L55

Would be also interesting to see how much memory tasks consume in this Nim implementation (my first guess would be at least one memory page - i.e. actually several times more than go routines).

It consumes 192 bytes per Task

I'm afraid I don't understand what you mean. Work stealing is a dynamic load balancing technique; it's all about moving tasks at runtime.

What I meant was how to accommodate to changing "resource availability". I.e. the operating system switches to power saving mode by turning off 90% of CPU cores. Work stealing will be left with tens of operating system threads (aka "work stealing workers") and the efficiency ratio of the application would steeply drop due to this (as there will be more than an order of magnitude more context switches needed).

Another issue is with non-SMP systems (e.g. few super slow CPU cores and more super powerful CPU cores like in big.LITTLE) whereas work stealing seems to not have any (efficient) means how to "move a computationally intensive task" from a worker running such already started task on a super slow CPU core to a worker running on a super powerful CPU core.

Third issue (not related to work stealing in any way, but relevant to Nim multithreading runtime) I can see is, that plain work stealing doesn't provide any higher-level means to dynamically spawn additional instances of (reentrant) tasks in case some task slows down the computation dependency graph (and later kill superfluous instances of such tasks).

This everything seems to be needed to be known in advance (i.e. predicted) in compile time.

Nothing works at compile-time, work-stealing has been formally proven asymptotically efficient, i.e. there may be multiple algorithms that approaches perfect scheduling on an unknown amount of computing resources possibly varying overtime and work-stealing is one of them. (http://supertech.csail.mit.edu/papers/steal.pdf)

If there is an imbalance between worker speed, the fast workers will finish earlier and pick up the unstarted tasks of the slow workers.

If there is only one task, it's the job of the OS to affect the thread to a fast core (unless threads are pinned to cores).

Work-stealing can deal with heterogenerous architectures, this is an active area of research for CPU+GPU clusters where GPU are 100x more efficient at certain workloads than CPU, they still use work-stealing:

Asymmetry-aware work stealing: http://www.csl.cornell.edu/~moyang/pdfs/torng-aaws-isca2016.pdf
Heterogeneous workstealing resources (Kokkos, XKaapi, StarPU): https://github.com/numforge/laser/blob/master/research/runtime_threads_tasks_allocation_NUMA.md#heterogeneous-arch-scheduling

It consumes 192 bytes per Task

The 1KByte I referred to is the unique stack minimum size of a running goroutine. In case of your implementation this would be the stack size of the routine referred to by fn*: proc (param: pointer) {.nimcall.} # nimcall / closure?). What is the minimal stack size (I'm not referring to thread-local stack size as in here)? (the 192 bytes you're referring to are rather to be compared to the golang scheduler metadata and the size is more or less the same)

If there is an imbalance between worker speed, the fast workers will finish earlier and pick up the unstarted tasks of the slow workers.

I understand - you already pointed it out in your previous answer to preemptiveness.

If there is only one task, it's the job of the OS to affect the thread to a fast core (unless threads are pinned to cores).

Not exactly. Because as I said when the number of cores gets higher and some of them will be suspended (due to power saving or whatever), then there will be too many operating system threads ("workers" in work-stealing terminology) running on one CPU core and thus absolutely unnecessarily burning CPU on context switches and invalidating CPU caches due to context switches. Simple solution would be to cut down the number of workers, but that seems not implemented (maybe not easily possible) in the work-stealing implementations I've seen.

Work-stealing can deal with heterogenerous architectures, this is an active area of research for CPU+GPU clusters where GPU are 100x more efficient at certain workloads than CPU, they still use work-stealing: ...

Yep, I saw your materials and I'm impressed you gathered so many of them. Still, everything I saw was compile time decided (predicted) - i.e. heterogeneous, but not dynamic (i.e. cluster nodes of different performance came and left during runtime or as mentioned above, some CPU cores got powered off to save energy effectively leading to lower efficiency on the remaining cores due to significantly reduced efficiency due to way more frequent context switching and inefficient caching). This seems to be rather implementation-shortcomings than work-stealing issue as I don't see any obstacle in implementing it also for work-stealing (though it could make work-stealing slightly less efficient due to the need to synchronize regularly and decide on moving running tasks to a different worker and then canceling the worker with her operating system thread).

Not exactly. Because as I said when the number of cores gets higher and some of them will be suspended (due to power saving or whatever), then there will be too many operating system threads ("workers" in work-stealing terminology) running on one CPU core and thus absolutely unnecessarily burning CPU on context switches and invalidating CPU caches due to context switches. Simple solution would be to cut down the number of workers, but that seems not implemented (maybe not easily possible) in the work-stealing implementations I've seen.

I get what you mean, but how realistic is something like this in the presence of busy workers, or, slightly less drastic, having the operating system step in and slow down busy workers? It might in fact be more energy efficient to let workers finish their tasks as fast as possible and power down when there's nothing else to do. Anyway, I think you're right that dynamic thread pools are largely unexplored in the context of work stealing.

To summarize, a task, as used in this discussion, is a data structure representing a deferred function call. A task contains a function pointer, function arguments, and maybe other data, like a referencing environment or a channel for sending results. Running a task amounts to an indirect function call; no separate stack is set up. Implementing tasks like this is not without problems: worker stacks can grow very deep. Different implementations have been explored to solve the "cactus-stack problem". If you're interested in some background reading, two papers I recommend are by Lee et al. and Yang et al.

I get what you mean, but how realistic is something like this in the presence of busy workers, or, slightly less drastic, having the operating system step in and slow down busy workers? It might in fact be more energy efficient to let workers finish their tasks as fast as possible and power down when there's nothing else to do. Anyway, I think you're right that dynamic thread pools are largely unexplored in the context of work stealing.

How I got convinced, that it's actually a problem was by reading a question on a technical forum "how to program for a 256 core CPU with hyperthreading with the assumption, that the SW shall support live load-balancing based on current demand (e.g. black friday on e-shop or some sport match or whatever or really just someone disconnecting her tablet with 8 non-SMP cores from an LCD and walking away wanting to serve as much power as possible while working with the app further but in a "low-power" regime) and live migration as a response e.g. to hot-swap or simply to cluster load, maintenance needs (including live update which Nim has built-in support for), etc. We're just not used to think this way, because technology is still extremely constraining us, but we're kind of getting to the inflection point...

Different implementations have been explored to solve the "cactus-stack problem". If you're interested in some background reading, two papers I recommend are by Lee et al. and Yang et al.

Thanks for the suggested papers. Coroutines execution can be also implemented as a "cactus stack", but for whatever reason (didn't find any convincing yet) I could not find any such implementation.

Btw. in this Nim implementation I noticed there are bounded stacks, but didn't look into it further. Will follow Picasso to hear more about how this everything will evolve.

Btw. in this Nim implementation I noticed there are bounded stacks, but didn't look into it further.

Oh, the bounded stack is just an array that workers use to manage their steal requests. Nothing exciting, and unrelated to the worker stacks I mentioned.

I had no time for Picasso in September and October but I have more time now and could work on it the past weekend.

Status

While the original proof of concept was done, the examples were buried in the repo, now you can use this file directly at the root the repo as a template. Note that the keywords async, await, Future were from @aprell thesis and and will be changed to spawn, ^, Flowvar or the corresponding one that @Araq will present in his threading API.

The refactoring of the proof-of-concept happens in the picasso directory. I have refactored or copied most of the supporting data structures:

Channels (MPSC and SPSC stands for Multi-Producer Single-Consumer and Single-Producer Single-Consumer):
- MPSC Channels to receive steal requests (lock-based, bounded by the number_of_threads * max_steal_requests_per_thread)
- SPSC Channels to receive tasks stolen and Flowvar (atomics-based wait-free, only 1 slot)
Memory caching:
- object pools: the SPSC channels for tasks stolen act like a mailbox/address that are sent with each steal request (i.e. "Please send spare tasks you have to this channel) and can be reused without reallocating.
- intrusive stacks: the tasks need to be cached to not overwhelm the memory allocator (Fibonacci would allocate 2^n tasks otherwise). They are already used in an intrusive deque so we can reuse those fields.
work-stealing deques
barriers
metrics: perf profilers and timers to compare to the PoC.

What's next?

Next step is the runtime logic which need a heavy lifting to ease maintenance, ideally:

grouping thread-local variables in a context,
cleaning up the partitioning scheme,
separating domains:
- victims,
- thiefs,
- lazy task splitting,
- work-sharing,
- public API (init, barrier, exit, manual polling)
- termination detection

Want to read on low-level details

If you want to dive into the rabbit holes of low-level details, I've been adding documentation, thoughts, papers on what I implement, why, how it will be used in companion Markdown files.

Original research

Memory and Speed efficient ring-buffer design

I spent a lot of time tuning channels and queues backed by ring-buffers as those are used throughout the codebase, one thing I was very unhappy with space inefficiency or speed. Especially given that I want to use some of those on the stack to improve locality and stack space is more limited.

As far as I know all queues and ring-buffer implementations use either an extra slot to differentiate if a ring buffer is empty or full (due to ring-buffer operating on modulo semantics) or they require power of 2 capacity to have cheap modulo via and masking but leading to lots of wastage.

I have implemented a ring-buffer that does not waste space and also does not use modulo at all as non-power of 2 mod is at least 10x slower to 30x slower than other intrinsics like add, mul, shifts, and, xor, ...

Threadsafe object pool

The current thread-local object pool has a very simple use-case: a low per-thread fixed number of task channels are being reused over and over.

However I am becoming increasingly convinced that I need a multithreaded object pool:

Tasks are cached in a per thread an intrusive stack. Those tasks are never returned to the OS (what if you start a computation that spawns a high number of tasks say a tree algorithm, and then only need a low-number of task for hours or days like an matrix-based algorithm).
Futures/Flowvar allocated on the heap incur at least 3x more framework overhead than futures allocated on the stack via alloca on a fibonacci example. Not that this overhead is completely invisible for tasks that require more than a couple nanoseconds to complete and only visible on workload like naive fibonacci though there might be tree algorithms that degenerates to nanosecond tasks. The main concerns are:
- with stack allocation, we can pass the future to a child proc, but we can't return it.
- with heap allocation, there are theoretical bounds on the number of cache misses is that on long-running process it will lead to memory fragmentation in the OS. A threadsafe object pool would give:
- stack allocation-like performance
- memory locality / reduction of cache misses
- reduced fragmentation
- able to return the future

A sketch design is available here with a thread-local cache based on a circular buffer and a global heap that will be likely inspired:

by Mimalloc (a.k.a. @Araq secret's GC) on free list sharding and eager page reset
and Snmalloc because we want channels all the way down to memory management :wink: when the destroying thread wants to sends the object back to the thread that created it.

What's missing in Nim

For now 3 things come to mind.

Local barriers API

Threadpools define sync but ideally I would like an API that mimics Posix or Windows which allow waiting on a well-defined object and not in a global manner.

# Posix
proc pthread_barrier_wait*(
        barrier: var PthreadBarrier
      ): Errno {.header: "<pthread.h>".}

# Windows
BOOL EnterSynchronizationBarrier(
  LPSYNCHRONIZATION_BARRIER lpBarrier,
  DWORD                     dwFlags
);

Capturing the environment like a closure

To provide a syntax similar to the parallel block or OpenMP #pragma omp parallel for, I would need to be able to analyze the body of a macro and for each variable that is non-local to that macro body, capture its value from the environment and package it with the Task object. Note that Nim closures won't work due to garbage collection, also I would need some extra logic to type erase proc {.nimcall.} and proc {.closure.} (probably via {.union.} types).

There is a hidden feature-gated owner macro that would be a solution (https://github.com/nim-lang/Nim/pull/8253) but it's not recommended for use.

Per-fields object alignment

For efficient multithreading all written shared variables in shared data structures must be on different cache line from the rest. The most efficient way is to request 64 or 128 bytes alignment on a per field basis. The poor man way is to just pad by 64, but that leads more memory use than necessary and aligned access may provide significant optimization opportunities, if only for memcpy. And lastly, it's not elegant at all and Nim is about efficiency, expressiveness and elegance.

So it's been about 6 weeks since the last update.

The first release of Weave should happen today, as soon as win against:

OSX and/or TLS emulation bug (AFAIK this was added for iPhones and we shouldn't have this by default on Mac)
OSX missing aligned_alloc and so not implementing the C11 spec
some potential deadlock/livelock in a lockless channel (yep that happened, twice)

Anyway here was the changes since last time:

The scheduler and runtime have been implemented (i.e. everything that was in what's next)
Weave uses the new {.align:128.} pragma
The threadsafe memory pool has been implemented. It has been paired with a lookaside buffer as some workloads which created trillions of tasks at the start were seeing significant performance degradation (like 20x). Together, the same workloads have seen a 20% per improvement. The excellent news is that memory can be released to the OS on long-running processes. The memory pool is based mostly on Microsoft's Mimalloc (the new kid on the block) and Snmalloc (a message-passing based allocator) but greatly simplified as I only have to manage a single block size. The memory pool provides an allocation "heartbeat" that the lookaside buffers hooks into to trigger expensive memory management that must be amortized on many allocations. This also seems to solve the decades old Cactus Stack issue without any assembly so maybe there is paper potential here.
The API seems solid:
- init(Weave), exit(Weave) to start and stop the runtime. Forgetting this will give you nil pointer exceptions on spawn.
- spawn fnCall(args) which spawns a function that may run on another thread and gives you an awaitable Flowvar handle.
- sync(Flowvar) will await a Flowvar and block until you receive a result.
- sync(Weave) is a global barrier for the main thread on the main task. Allowing nestable barriers for any thread is work-in-progress.
- parallelFor, parallelForStrided, parallelForStaged, parallelForStagedtrided are described above and in the experimental section.
- loadBalance(Weave) gives the runtime the opportunity to distribute work. Insert this within long computation as due to Weave design, it's busy workers hat are also in charge of load balancing. This is done automatically when using parallelFor.
- isSpawned allows you to build speculative algorithm where a thread is spawned only if certain conditions are valid. See the nqueens benchmark for an example.
- getThreadId returns a unique thread ID. The thread ID is in the range 0 ..< number of threads.

10 benchmarks have been written with the following 8 showing excellent performance either equivalent or significantly better than industry-grade runtime like Intel TBB or GCC, Clang and Intel OpenMP:

Name	Parallelism	Notable for stressing	Origin
Black & Scholes Option Pricing (Finance)	Data parallelism		PARSEC (Princeton Application Repository for Shared-Memory Computers)
DFS (Depth-First Search)	Task Parallelism	Scheduler Overhead	Staccato
Fibonacci	Task Parallelism	Scheduler Overhead	Cilk
Heat diffusion (Stencil / Jacobi-iteration - Cache-Oblivious)	Task Parallelism		Cilk
Matrix Multiplication (Cache-Oblivious)	Task Parallelism		Cilk
Matrix Transposition	Nested Data Parallelism	Nested loop	Laser
Nqueens	Task Parallelism	Speculative/Conditional parallelism	Cilk
SPC (Single Task Producer)	Task Parallelism	Load Balancing	Tasking 2.0 (A. Prell Thesis)

The 2 remaining benchmarks are reductions where Weave is significantly slower than sequential.

A CI with Azure Pipelines and Travis has been put in place:
- Travis covers Linux on ARM64 with 32 cores
- Azure covers Linux on 32-bit and 64-bit and C++ compilation.
- MacOS is also tested but allowed to fail (which it does for a very strange reason, it would need direct debugging on MacOS)
Windows supports should be easy, it only requires Synchronization Barrier, the memory subsystem should be compatible with Windows (including aligned allocation)
The bitsets for victim selection where removed in favor of sparse bitsets together with optimization in passing steal requests around (by pointer/ownership instead of by copy) this brought large performance improvement and scaling to beyond 64 cores (bitsets beyond uint64 were costly to copy)

Experimental

A backoff scheme has been implemented to allow workers without tasks to sleep. It's not activated by default though as it's prone to deadlocking the runtime. (A worker sleep and others wait for it indefinitely)
Lazy Flowvar allocations seem to work correctly and is part of the test suite. It optimistically allocates flowvar on the stack via alloca and only allocate them on the heap if they escape their thread. This brings 2x~3x perf improvements on Fibonacci and Depth First Search and very-fine grained tasks in general be removing the allocator bottleneck. However it's currently limited to Flowvar of at most the size of a int/pointer/word.

Some findings

Nim allocator (createShared/allocShared) is extremely slow in a multithreaded environment due to locking.

Some utilities that may be "librarized"

I wrote iterators for implicit binary array trees (i.e. root with ID 0, child 1 and 2, and grandchildren at 2n+1 and 2n+2). The depth first iterator is stackless and requires O(1) space, O(n + log(n)) time. The breadth-first search iterator is queueless and requires O(1) space and O(n) time. https://github.com/mratsim/weave/blob/v0.1.0/weave/datatypes/binary_worker_trees.nim#L91-L161
The lookaside buffer and the threadsafe memory-pool are concise and may be helpful for game developers. They might want to tune the block size and not allowing releasing memory blocks to the OS. https://github.com/mratsim/weave/tree/v0.1.0/weave/memory

Future

My initial goal with Picasso/Weave was to use it as the backend for Laser and in particular allow seamless use of a task API with spawn/sync and data parallelism with nested parallel loop. This has been achieved.

However I still cannot parallelize my matrix multiplication kernel like how I use OpenMP (after finding workaround for OpenMP lack of nested parallelism). This is due to the need of expressing fine-grained dependencies in of parallel nested loops, for example "await until iteration range 0..20 from the previous loop has been processed". An alternative to that is nested barriers.

However I feel like adding and debugging new features to the runtime is becoming difficult due to the control flow which add complexity on top of potential memory and threading bugs. Hence to ease future developments, help visualize the inner workings and maybe allow easier model checking and formal verification, I think it's time for a 3rd rewrite. This time by transforming the core into a state machine. I developed Synthesis, a finite-state-machine generator, yesterday with that purpose.

On the short term, the runtime is usable (not production ready) on Linux and probably OSX so it's being released today.

2 new releases in the past 15 days: https://github.com/mratsim/weave/releases

v0.2.0

Brings Windows support
Formally verified the backoff mechanism, it is now enabled by default (it had several deadlocks and livelocks in the past)

v0.3.0

Weave can now compile with Microsoft Visual Studio Compiler, in C++ mode (there are unsupported atomics with VCC in C mode)
for-loops are now awaitable, note that only the spawning threads will be blocked other will continue on other tasks. Being blocked means that the thread "stay still" in the code but it will still complete tasks pending in its queue while waiting for the blocking one to be resolved (by itself or another worker).
Research flags for stealing early or thief targeting have been added.
Weave now uses Synthesis state machine in several places. I am still unsure on the readability benefits (maybe I'm too familiar with the codebase now) but if visual graphs/description were added to Synthesis that would definitely tip the scale.
The memory pool now has the exact same API has malloc/free (previously freeing required specifying the caller threadID). The scheme to retrieve an unique thread identifier without expensive syscalls is probably worthwhile in Nim: https://github.com/mratsim/weave/blob/v0.3.0/weave/memory/thread_id.nim#L8-L66
The memory pool and lookaside list have been annotated for use with LLVM AddressSanitizer, a memory accesses debugger tool. There are warnings, I didn't check yet if they are spurious or not. Some are due to Nim internals.
Significant performance bugs and improvements were identified on data parallelism (for example not splitting parallel loops in some cases). Weave is now competitive with OpenMP for coarse-grain loops (with a large amount of work) and hopefully doesn't suffer from OpenMP issues on loops that are too small and shouldn't be parallelized as Weave does parallelization lazily on a as-needed basis (to be benchmarked, see https://github.com/zy97140/omp-benchmark-for-pytorch)
The highlight is that on my 18-cores machine a pure Nim matrix multiplication without assembly without OpenMP is now much faster than OpenBLAS and competitive with Intel MKL and Intel MKL-DNN which are 90% assembly, processor-tuned and the result of decades of dedicated development. Actually if we just look at parallelization efficiency (time 18 cores / time 1 core):
- Weave with Backoff achieves the same speedup as Intel MKL + Intel OpenMP at 15.0~15.5x speedup
- Weave is much better than Intel MKL + GCC OpenMP at 14x speedup
- Weave without Backoff achieves a speedup of 16.9x
- Note that this is the only benchmark among the 12 I have where the backoff degrades performance.

One thing of note: measuring performance on a busy system is approximative at best, you need a lot of runs to get a ballpark figure. Furthermore for multithreading runtime, workers often "yield" or "sleep" when they fail to steal work. But in that case, the OS might give the timeslice to other processes (and not to other thread in the runtime). If a process like nimsuggest hogs a core at 100% it will get a lot of those yield and sleep timeslices even though your other 17 threads would have made useful progress. The result is that while nimsuggest is stuck at 100% (or any other application), Weave gets worse than sequential performance and I don't think I can do anything about it.

This RFC is stale because it has been open for 1095 days with no activity. Contribute a fix or comment on the issue, or it will be closed in 7 days.

nim-lang / RFCs