nim-lang / RFCs

A repository for your Nim proposals.
135 stars 26 forks source link

std/tasks: "small functions" optimization #443

Open mratsim opened 2 years ago

mratsim commented 2 years ago

Currently std/tasks is implemented the following way:

type
  Task* = object ## `Task` contains the callback and its arguments.
    callback: proc (args: pointer) {.nimcall, gcsafe.}
    args: pointer
    destroy: proc (args: pointer) {.nimcall, gcsafe.}

Any function call with an argument will trigger an allocation. This is something that runtimes that use std/task can't avoid. Given that runtimes also have to allocate a future/flowvar, scheduling a task always incur allocation overhead even if there are just 1 or 2 arguments that needs to be passed.[1]

GCC and Clang have an optimization where std:function can store up to 8 (?) bytes intrusive to the data structure and uses the heap otherwise.

It would significantly reduce memory overhead and memory fragmentation for long-running tasking runtimes if this was implemented:

--

[1]: Runtime actually don't have to allocate a future/flowvar if those don't escape, they can use the caller stackframe if the compiler helps with heap allocation elision: (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p2477r0.html#design-and-example, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0981r0.html)

Proposal

For a to be decided size N, change the type to:

type
  Task* = object ## `Task` contains the callback and its arguments.
    callback: proc (args: pointer) {.nimcall, gcsafe.}
    args: pointer
    destroy: proc (args: pointer) {.nimcall, gcsafe.}
    argsBuffer: array[N, byte]

args would point to either the buffer or the heap, change would be minimal and would mainly require changing this:

let scratchObjPtrType = quote do:
  cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))

to something in that vein.

let scratchObjPtrType = quote do:
  when sizeof(`scratchObjType`) <= N:
    cast[ptr `scratchObjType`](task.buffer.addr)
  else:
    cast[ptr `scratchObjType`](c_calloc(csize_t 1, csize_t sizeof(`scratchObjType`)))

at https://github.com/nim-lang/Nim/blob/d102b2f54c41f19a0545f28109d0a93550b5d886/lib/std/tasks.nim#L187

Discussion of the buffer size

The current task object has a size of 24 bytes.

Araq commented 2 years ago

This will probably only work with ARC/ORC but then so be it.

mratsim commented 2 years ago

I think std/tasks already requires either non-thread-local GC (so Boehm would work as well) or value types.

Unless you use tasks in a non-multithreading context, but libraries that might want to do that (asyncdispatch/chronos) use closure iterators.

ringabout commented 2 years ago

In the current implementation, task is allocated in the stack and is cleaned up before toTask so that the address of task.buffer is lost.

We should probably pass task.buffer in invoke procs, which requires to change the signature of invoke procs (breaking changes)

proc invoke*(task: var Task) {.inline, gcsafe.} =
  ## Invokes the `task`.
  assert task.callback != nil
  if task.args == nil and task.destroy != nil:
    task.callback(addr task.argsBuffer)
  else:
    task.callback(task.args)

(though Task is always passed by reference internally)

mratsim commented 2 years ago

See discussion on Discord https://discord.com/channels/371759389889003530/768367394547957761/933307741462753300

The changes indeed work https://github.com/status-im/nim-taskpools/commit/76a8593a80a756814205952d331c55883f6f7d02 but only if the spawner doesn't exit its scope before returning.

AKA, structured parallelism: spec https://github.com/nim-lang/RFCs/issues/347#structured-parallelism-optional

Structured parallelism [OPTIONAL]

Structured parallelism ensures that all tasks and their descendants created within a scope are completed before exiting that scope.

template syncScope(ex: MyExecutor, scope: untyped): untyped template syncScope(scope: untyped): untyped

References:

  • X10 Parallel Programming Language (2004), the Async-Finish construct
  • Habanero multithreading runtime for Java and Scala, the Async-Finish construct
  • Habanero C, the Async-Finish construct

More recently, concurrency frameworks also converged to similar "structured concurrency" (https://en.wikipedia.org/wiki/Structured_concurrency).

One way to make it work is in =move: if the task being moved is pointing to its argsBuffer, change to point to the new argsbuffer

proc `=sink`(dst: var Task, src: Task) =
  `=destroy`(dst)
  dst.callback = src.callback
  dst.destroy = src.destroy
  if src.args == cast[pointer](src.argsBuffer.addr):
    # The arguments are stored inline the task
    dst.argsBuffer = src.argsBuffer
    dst.args = cast[pointer](dst.argsBuffer.addr)
  else:
    # The arguments are stored on the heap
    dst.args = src.args

This also implicitly assumes that Tasks are {.byref.} types.

ringabout commented 2 years ago
block:
  var num = 0
  proc hello(a: int) = inc num, a

  let b = toTask hello(13)
  b.invoke()
  echo num
  assert num == 13

doesn't work with POC

0
Error: execution of an external program failed

It generates c code like this:

image

task is zero memoryed after toTask is called. Then task.argsBuffer become all zeros And b.args still points to a zeros array (task.argsBuffer) that seems why echo num gives 0