lewissbaker / cppcoro

A library of C++ coroutine abstractions for the coroutines TS
MIT License
3.42k stars 468 forks source link

Allow using a custom allocator for allocating the coroutine frame memory #72

Open lewissbaker opened 6 years ago

lewissbaker commented 6 years ago

Allow callers to specify a custom allocator to use for allocating the coroutine frame.

General usage guide:

struct MyAllocator
{
    MyAllocator();
    MyAllocator(const MyAllocator&);
    ~MyAllocator();
    void* allocate(std::size_t sz);
    void free(void* p);
private:
    void* m_state;
};

task<int> bar(std::allocator_arg_t, MyAllocator allocator)
{
    co_return 2;
}

task<int> foo()
{
    MyAllocator a;
    int result = co_await bar(std::allocator_arg, a);
    co_return result + 1;
}

See example implementation here, first presented in cpplang slack chat 2018/02/27: https://godbolt.org/g/gRbDYf

lewissbaker commented 6 years ago

CC: @GorNishanov

burnpanck commented 5 years ago

Please excuse me shamelessly asking this only partially-related question here: In a fork of cppcoro (sorry, not published yet - still far too experimental and volatile), I have been toying around with coroutines especially for small-memory 32bit embedded systems. Of course, allocation in general (heap or otherwise) is a big concern there. HALO seems to fail for me way too often, so I attempted to implement the make_on_stack<n>([=](auto)->task<>{...}) in-your-face approach from Gor's p1365r0 paper. Here, make_on_stack creates a task-wrapper that includes space for the lambda and the promise; that space is then passed to the promise's operator new through that magic (auto) argument of the lambda, where a custom allocator is passed. One thing that bugs me, is that the coroutines frames are huge (I often see something like >128 bytes for a simple double_it task on my 64bit host with clang 8). While there are probably opportunities for improvement, I find it a bit disappointing that this approach will leave a copy of the allocator alive in the coroutine frame for the whole duration of the coroutine, just because it appears in the argument list. Especially because, if it were needed in the deallocation, we would still need an additional copy, as you well explained in your blog. Furthermore, the promise also needs to keep a copy of the this pointer pointing to that lambda, because that's where all the real coroutine arguments live. Now, it seems that coroutine_traits does receive the type of the this pointer, but operator new doesn't receive the actual pointer. If this would also be passed to operator new, we could invent a protocol, where that object does also expose the allocator. So here it comes: Since it seems that the coroutine specification is not yet completely finished, where would be a good place to bring this point to the attention of the design-team of the specification?

GorNishanov commented 5 years ago

Hi Burnpanck:

Are you using clang or MSVC compiler? If clang, what version of clang? Are you compiling debug or retail?

In the currently available compilers, computation of what goes into the coroutine frame is not optimized for space yet. Information of lifetime of variables are not used and no stack packing (putting some variables on top of others based on the lifetime if they don't intersect) is not performed.

If you are not seeing a pointer to "this" passed to on operator new, I would need to know what exact compiler are you using. It should be passed in both. Note that it will pass a reference to a this object. Not a pointer (I think earlier version of MSVC were passing the pointer, until the issue was settled in the coroutines TS wording).

lewissbaker commented 5 years ago

Note that P1681R0 has a discussion of alternative ideas for allocator customisation that should be considered.

@burnpanck Consider the following task<float>:

template<typename Allocator>
task<float> double_it(allocator_arg_t, Allocator alloc, task<float> t) {
  co_return 2.0f * co_await std::move(t);
}

On x64 a typical coroutine frame for a 'task' with customised alloctor in Clang will contain:

By my calculation that's about 88 bytes minimum before you add any local variables.

Depending on how task::operator co_await() is implemented the coroutine may need to store an additional awaiter object which also contains a copy of the coroutine_handle (another 8 bytes).

The compiler is also free to store additional state (spilled registers, etc.).

MSVC has been a bit more eager to stash copies of registers across suspend points which could easily account for an extra 32 bytes.

Clang is a bit better, but still stores more local-variable state in the coroutine frame than it should.

Avoiding the extra allocator copy would only net you an 8 byte win here. If the allocator was stateless and was made a template argument to the task type then we could avoid storing the allocator altogether (in params and in allocation) which would save 24 bytes.

burnpanck commented 5 years ago

Sorry @GorNishanov, I had completely missed your message. I had done the host-based tests using Clang 8, the bare-metal ARM embedded compilation is done by a slightly patched version of Clang 9rc1. I have to admit that it is already some time since I had read the TS (I might possibly not have seen the most recent version). It does seem though that my operator new only gets two arguments, the size and the allocator which was supplied via the only argument of the lambda.

In my case, I have a coroutine as follows:

auto double_it(int x){
  return make_on_stack<256>([=](auto)->task<int>{
    co_return 2*x;
  });
}

make_on_stack then intializes a task_on_stack<256,_some_lambda_type,int> instance that takes the lambda, evaluates it with an allocator allocating within that object's scratch-space, and moves the task_promise into it's own storage. The operator new that calls the allocator has a signature as follows:

  template <typename... Args>
  void *operator new(std::size_t sz, Args&&... args);

However, in the above example, the args only contains the allocator, not the lambda object. I will therefore consider filing a bug-report with clang, if I can find the corresponding wording in the Coroutine TS.

burnpanck commented 5 years ago

@lewissbaker thanks for your detailed answer! I wasn't aware of P1681r0, thanks for the hint. My understanding from a quick reading is that it is about standard-library elements allowing allocator implementations that do not require a parallel reimplementation of the coroutine types. I do like your proposal there, but it does not solve my case for avoiding storing the allocator state in the coroutine frame, when it is only needed during the construction of the frame.

You are of course correct that I could possibly only win one pointer here. Even worse, I'm not even sure that my plan is actually feasible: I have to move the lambda into that task_on_stack object to keep it alive through the coroutine execution, but when I call the lambda to create the coroutine frame, it would still only receive a reference to the lambda, but not the task_on_stack. I would have to assume that the lambda indeed lives within such an object and somehow adjust the pointer to get one pointing to the beginning of the container.

Nonetheless, I would have hoped to get a bit better trimmed coroutine frames than what you described: The allocator for make_on_stack could be stateless (memory does not need to be released), saving us two pointers and padding. On my RAM constrained system, I would further have preferred an implementation using just one pointer to a vtable-like structure, holding the resume and destroy function pointers (possibly even the resumption state) in constant storage. Finally, in the make_on_stack case, the allocator is not needed inside the coroutine, so it wouldn't have to be stored. I would have loved to see a facility in the coroutine transformation protocol that would allow me to pass the allocator without ever appearing in the argument list, there is hope that compiler optimisations eventually learn to recognise that the allocator is not needed and therefore doesn't have to be preserved. Thus, I could imagine a coroutine frame overhead (i.e. not counting the real parameters or temporary state over suspension points) for coroutines on the stack of as little as two pointers (resume_destroy_and_state and the continuation handle) plus the exception-pointer (if exceptions are enabled at all - we're in embedded) and the return value. Well, I'm dreaming here.