Open lewissbaker opened 6 years ago
CC: @GorNishanov
Please excuse me shamelessly asking this only partially-related question here:
In a fork of cppcoro (sorry, not published yet - still far too experimental and volatile), I have been toying around with coroutines especially for small-memory 32bit embedded systems. Of course, allocation in general (heap or otherwise) is a big concern there. HALO seems to fail for me way too often, so I attempted to implement the make_on_stack<n>([=](auto)->task<>{...})
in-your-face approach from Gor's p1365r0 paper. Here, make_on_stack
creates a task-wrapper that includes space for the lambda and the promise; that space is then passed to the promise's operator new
through that magic (auto)
argument of the lambda, where a custom allocator is passed. One thing that bugs me, is that the coroutines frames are huge (I often see something like >128 bytes for a simple double_it
task on my 64bit host with clang 8). While there are probably opportunities for improvement, I find it a bit disappointing that this approach will leave a copy of the allocator alive in the coroutine frame for the whole duration of the coroutine, just because it appears in the argument list. Especially because, if it were needed in the deallocation, we would still need an additional copy, as you well explained in your blog. Furthermore, the promise also needs to keep a copy of the this
pointer pointing to that lambda, because that's where all the real coroutine arguments live. Now, it seems that coroutine_traits
does receive the type of the this
pointer, but operator new
doesn't receive the actual pointer. If this
would also be passed to
operator new
, we could invent a protocol, where that object does also expose the allocator. So here it comes: Since it seems that the coroutine specification is not yet completely finished, where would be a good place to bring this point to the attention of the design-team of the specification?
Hi Burnpanck:
Are you using clang or MSVC compiler? If clang, what version of clang? Are you compiling debug or retail?
In the currently available compilers, computation of what goes into the coroutine frame is not optimized for space yet. Information of lifetime of variables are not used and no stack packing (putting some variables on top of others based on the lifetime if they don't intersect) is not performed.
If you are not seeing a pointer to "this" passed to on operator new, I would need to know what exact compiler are you using. It should be passed in both. Note that it will pass a reference to a this object. Not a pointer (I think earlier version of MSVC were passing the pointer, until the issue was settled in the coroutines TS wording).
Note that P1681R0 has a discussion of alternative ideas for allocator customisation that should be considered.
@burnpanck Consider the following task<float>
:
template<typename Allocator>
task<float> double_it(allocator_arg_t, Allocator alloc, task<float> t) {
co_return 2.0f * co_await std::move(t);
}
On x64 a typical coroutine frame for a 'task' with customised alloctor in Clang will contain:
operator new()
needs to stash some extra data after the allocation for the coroutine-frame.
By my calculation that's about 88 bytes minimum before you add any local variables.
Depending on how task::operator co_await()
is implemented the coroutine may need to store an additional awaiter object which also contains a copy of the coroutine_handle (another 8 bytes).
The compiler is also free to store additional state (spilled registers, etc.).
MSVC has been a bit more eager to stash copies of registers across suspend points which could easily account for an extra 32 bytes.
Clang is a bit better, but still stores more local-variable state in the coroutine frame than it should.
Avoiding the extra allocator copy would only net you an 8 byte win here. If the allocator was stateless and was made a template argument to the task type then we could avoid storing the allocator altogether (in params and in allocation) which would save 24 bytes.
Sorry @GorNishanov, I had completely missed your message. I had done the host-based tests using Clang 8, the bare-metal ARM embedded compilation is done by a slightly patched version of Clang 9rc1. I have to admit that it is already some time since I had read the TS (I might possibly not have seen the most recent version). It does seem though that my operator new
only gets two arguments, the size and the allocator which was supplied via the only argument of the lambda.
In my case, I have a coroutine as follows:
auto double_it(int x){
return make_on_stack<256>([=](auto)->task<int>{
co_return 2*x;
});
}
make_on_stack
then intializes a task_on_stack<256,_some_lambda_type,int>
instance that takes the lambda, evaluates it with an allocator allocating within that object's scratch-space, and moves the task_promise
into it's own storage. The operator new
that calls the allocator has a signature as follows:
template <typename... Args>
void *operator new(std::size_t sz, Args&&... args);
However, in the above example, the args
only contains the allocator, not the lambda object. I will therefore consider filing a bug-report with clang, if I can find the corresponding wording in the Coroutine TS.
@lewissbaker thanks for your detailed answer! I wasn't aware of P1681r0, thanks for the hint. My understanding from a quick reading is that it is about standard-library elements allowing allocator implementations that do not require a parallel reimplementation of the coroutine types. I do like your proposal there, but it does not solve my case for avoiding storing the allocator state in the coroutine frame, when it is only needed during the construction of the frame.
You are of course correct that I could possibly only win one pointer here. Even worse, I'm not even sure that my plan is actually feasible: I have to move the lambda into that task_on_stack
object to keep it alive through the coroutine execution, but when I call the lambda to create the coroutine frame, it would still only receive a reference to the lambda, but not the task_on_stack
. I would have to assume that the lambda indeed lives within such an object and somehow adjust the pointer to get one pointing to the beginning of the container.
Nonetheless, I would have hoped to get a bit better trimmed coroutine frames than what you described: The allocator for make_on_stack
could be stateless (memory does not need to be released), saving us two pointers and padding. On my RAM constrained system, I would further have preferred an implementation using just one pointer to a vtable-like structure, holding the resume
and destroy
function pointers (possibly even the resumption state) in constant storage. Finally, in the make_on_stack
case, the allocator is not needed inside the coroutine, so it wouldn't have to be stored. I would have loved to see a facility in the coroutine transformation protocol that would allow me to pass the allocator without ever appearing in the argument list, there is hope that compiler optimisations eventually learn to recognise that the allocator is not needed and therefore doesn't have to be preserved.
Thus, I could imagine a coroutine frame overhead (i.e. not counting the real parameters or temporary state over suspension points) for coroutines on the stack of as little as two pointers (resume_destroy_and_state
and the continuation handle) plus the exception-pointer (if exceptions are enabled at all - we're in embedded) and the return value. Well, I'm dreaming here.
Allow callers to specify a custom allocator to use for allocating the coroutine frame.
General usage guide:
See example implementation here, first presented in cpplang slack chat 2018/02/27: https://godbolt.org/g/gRbDYf