std/tasks: "small functions" optimization

mratsim commented 2 years ago

Currently std/tasks is implemented the following way:

type
  Task* = object ## `Task` contains the callback and its arguments.
    callback: proc (args: pointer) {.nimcall, gcsafe.}
    args: pointer
    destroy: proc (args: pointer) {.nimcall, gcsafe.}

Any function call with an argument will trigger an allocation. This is something that runtimes that use std/task can't avoid. Given that runtimes also have to allocate a future/flowvar, scheduling a task always incur allocation overhead even if there are just 1 or 2 arguments that needs to be passed.[1]

GCC and Clang have an optimization where std:function can store up to 8 (?) bytes intrusive to the data structure and uses the heap otherwise.

It would significantly reduce memory overhead and memory fragmentation for long-running tasking runtimes if this was implemented:

Tasks are most useful in a multithreaded context and they are moved around threads. Reclaiming memory allocated from another thread is non-trivial for allocators and requires synchronization and advanced techniques to avoid contention like extreme freelist sharding in the Mimalloc allocator
Tasks may be quickly allocated and deallocated leading to fragmentation especially if args are small.
As the number of cores grows, fine-grained tasks become more important to exploit multiple cores, leading to more std/tasks overhead

--

[1]: Runtime actually don't have to allocate a future/flowvar if those don't escape, they can use the caller stackframe if the compiler helps with heap allocation elision: (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2477r0.html#design-and-example, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html)

Proposal

For a to be decided size N, change the type to:

type
  Task* = object ## `Task` contains the callback and its arguments.
    callback: proc (args: pointer) {.nimcall, gcsafe.}
    args: pointer
    destroy: proc (args: pointer) {.nimcall, gcsafe.}
    argsBuffer: array[N, byte]

args would point to either the buffer or the heap, change would be minimal and would mainly require changing this:

let scratchObjPtrType = quote do:
  cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))

to something in that vein.

let scratchObjPtrType = quote do:
  when sizeof(`scratchObjType`) <= N:
    cast[ptr `scratchObjType`](task.buffer.addr)
  else:
    cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))

at https://github.com/nim-lang/Nim/blob/d102b2f54c41f19a0545f28109d0a93550b5d886/lib/std/tasks.nim#L187

Discussion of the buffer size

The current task object has a size of 24 bytes.

Buffer of size 8: this rounds up the total size to 32, a nice power of 2. However the majority of function need more than 1 argument.
Buffer of size 16 and 24: many functions have the form proc(a, b): result or proc(r, a, b), this would cover those efficiently.
Buffer of size 40:
- this rounds up the total size to 64, a nice power of 2.
- 64 bytes is the cache-line size on most CPUs (when it's not 128), if a task is aligned, it can be loaded from L1 cache in a single cache read.
- This buffer size allows to store up to 5 ints/pointer arguments or 2 openarrays + 1 pointer/int (for example a callback when doing parallel map or a pointer to a result location) Working on large openarrays is often a very compelling reason to divide work in tasks.
- x86-64 can pass up to 6 arguments through registers and ARM64 up to 8 so there won't be stack spilling with that buffer size.
- Also due to false sharing a runtime that stores tasks contiguously for example in a shared task queue must make sure that they each start 64 bytes (or even 128 bytes due to CPU prefetching 2 cache-lines at a time). Even if tasks were smaller, space would be unused.

Araq commented 2 years ago

This will probably only work with ARC/ORC but then so be it.

mratsim commented 2 years ago

I think std/tasks already requires either non-thread-local GC (so Boehm would work as well) or value types.

Unless you use tasks in a non-multithreading context, but libraries that might want to do that (asyncdispatch/chronos) use closure iterators.

ringabout commented 2 years ago

In the current implementation, task is allocated in the stack and is cleaned up before toTask so that the address of task.buffer is lost.

We should probably pass task.buffer in invoke procs, which requires to change the signature of invoke procs (breaking changes)

proc invoke*(task: var Task) {.inline, gcsafe.} =
  ## Invokes the `task`.
  assert task.callback != nil
  if task.args == nil and task.destroy != nil:
    task.callback(addr task.argsBuffer)
  else:
    task.callback(task.args)

(though Task is always passed by reference internally)

mratsim commented 2 years ago

See discussion on Discord https://discord.com/channels/371759389889003530/768367394547957761/933307741462753300

The changes indeed work https://github.com/status-im/nim-taskpools/commit/76a8593a80a756814205952d331c55883f6f7d02 but only if the spawner doesn't exit its scope before returning.

AKA, structured parallelism: spec https://github.com/nim-lang/RFCs/issues/347#structured-parallelism-optional

Structured parallelism [OPTIONAL]

Structured parallelism ensures that all tasks and their descendants created within a scope are completed before exiting that scope.

template syncScope(ex: MyExecutor, scope: untyped): untyped template syncScope(scope: untyped): untyped

References:

X10 Parallel Programming Language (2004), the Async-Finish construct

Habanero multithreading runtime for Java and Scala, the Async-Finish construct

Habanero C, the Async-Finish construct

More recently, concurrency frameworks also converged to similar "structured concurrency" (https://en.wikipedia.org/wiki/Structured_concurrency).

One way to make it work is in =move: if the task being moved is pointing to its argsBuffer, change to point to the new argsbuffer

proc `=sink`(dst: var Task, src: Task) =
  `=destroy`(dst)
  dst.callback = src.callback
  dst.destroy = src.destroy
  if src.args == cast[pointer](src.argsBuffer.addr):
    # The arguments are stored inline the task
    dst.argsBuffer = src.argsBuffer
    dst.args = cast[pointer](dst.argsBuffer.addr)
  else:
    # The arguments are stored on the heap
    dst.args = src.args

This also implicitly assumes that Tasks are {.byref.} types.

ringabout commented 2 years ago

block:
  var num = 0
  proc hello(a: int) = inc num, a

  let b = toTask hello(13)
  b.invoke()
  echo num
  assert num == 13

doesn't work with POC

0
Error: execution of an external program failed

It generates c code like this:

task is zero memoryed after toTask is called. Then task.argsBuffer become all zeros And b.args still points to a zeros array (task.argsBuffer) that seems why echo num gives 0

nim-lang / RFCs

std/tasks: "small functions" optimization #443

Proposal

Discussion of the buffer size