Reduce zero-init overhead

To support destructible and sinkable types, in particular atomic refcounted types, tasks must zero-init their data buffer. This is introduced in #144 to properly support the refcounted FlowEvent.

However there is a significant 17% overhead on very short running tasks like Fibonacci(40)

Note: significant is relative, fibonacci spawns 2^40 tasks which are in the trillions and each task is simpler than zero initialization

The change: https://github.com/mratsim/weave/pull/144/files#diff-c5d52e34ee454756d2c729faec306b62L113

proc newTaskFromCache*(): Task =
  result = workerContext.taskCache.pop()
  result = workerContext.taskCache.pop0()
  if result.isNil:
  if result.isNil:
    result = myMemPool().borrow(deref(Task))
    result = myMemPool().borrow0(deref(Task))
  # Zeroing is expensive, it's 96 bytes
  # The task must be fully zero-ed including the data buffer

  # otherwise datatypes that use custom destructors
  # result.fn = nil # Always overwritten
  # and that rely on "myPointer.isNil" to return early
  # result.parent = nil # Always overwritten
  # may read recycled garbage data.
  # result.scopedBarrier = nil # Always overwritten
  # "FlowEvent" is such an example
  result.prev = nil

  result.next = nil
  # TODO: The perf cost to the following is 17% as measured on fib(40)
  result.start = 0

  result.cur = 0
  # # Zeroing is expensive, it's 96 bytes
  result.stop = 0
  # # result.fn = nil # Always overwritten
  result.stride = 0
  # # result.parent = nil # Always overwritten
  result.futures = nil
  # # result.scopedBarrier = nil # Always overwritten
  result.isLoop = false
  # result.prev = nil
  result.hasFuture = false
  # result.next = nil
  # result.start = 0
  # result.cur = 0
  # result.stop = 0
  # result.stride = 0
  # result.futures = nil
  # result.isLoop = false
  # result.hasFuture = false

The simple optimization would be to only zero init the part of the buffer that will be overwritten. An alternative would be to zero init the buffer only for non-trivial types as detected by supportsCopyMem. And a third possiblity would be to do both.

mratsim / weave

Reduce zero-init overhead #145