Closed eholk closed 2 years ago
Thanks for the careful thoughts!
Option 1 Probably the best bet. Doing most of the work up front or at insertions time is acceptable. The important aspect in my mind is to allow polling to be efficient since it is such a frequent and potentially spurious operation.
Option 2 seems more akin to what most schedulers will do, only gripe is that task storage can't be re-used as it currently stands because they need to be evicted immediately once they complete to free up the ID. Using a Slab for pre-allocation would require insertions and removals to be synchronized which might be hard.
The only way I'm aware of to truly do no allocations would be to statically allocating wakers, but it has many limitations.
I think 2 is doable and would work fine. All though it would be nice w/ benchmarks all though admittedly the current state of affairs would mostly compare how good of a small object allocator the system allocator is.
There are still more improvements that could be made, such as doing this in a way that will_wake
works. See this comment for more details.
Yep, will_wake
would be a great improvement. Now I'm forced to clone a waker one every time when underlying stream returns Pending
=(
Internals::clone
current does aBox::new
when cloning a waker. It'd be nice to remove this if possible.I've spent some time trying to do this but I'm not a huge fan of any of the solutions I've come up with. I thought I'd open a discussion here to see what others think.
Here are some options I had:
Option 1: Always put Internals in an Arc
This means
Internals::clone
would just be a call toArc::clone
, so we don't do a new allocation. The downside is that creating a new waker would always require a heap allocation, which is strictly worse than the current state of things where clone seems to be a relatively rare operation.Option 2: Pre-allocate wakers
There's a couple of ways this could look, but basically it would be either adding an extra
waker_slab
field toUnordered
that holds wakers for each task, or by changing the existing slab to hold(Internals, T)
and keep the waker there (whether this is safe depends on whether wakers can outlive the task they wake, which hopefully is true but I'm not sure it has to be). For the whole solution to work, we also need to make sure theUnordered
outlives all of its wakers as well.This solution is nice because we just need to allocate space for the wakers once and then we can reuse them. Cloning becomes just a pointer copy.
The downside is that it uses more space. It looks like about two words per future in the set. Maybe that's not terrible, but I know one of the goals for unicycle is reducing overhead. It's a case of trading off space usage versus allocation overhead, and it's not clear which is more important.
Option 3: Pointer tricks
This is almost certain not to work. The reason we need an allocation in the first place is that
RawWaker
only has a pointer to data and a vtable, but we need to store both a pointer to the parent waker as well as the index of the task to wake. The idea here would be to find some way to store the task index in some unused bits in the pointer to the parent waker.On Intel architectures, we could probably get away with this currently, since only 48 address bits are used. That gives us 16 to store the task index, plus maybe another 2 to 4 due to alignment. This means we could conceivably support about a million tasks this way.
That said, I don't think we can rely on all architectures having this limit, and I'm sure Intel architectures will eventually expand beyond 48 bits as well. I guess we could do something like
#[align(1<<20)]
on theSharedWaker
, which would guarantee 20 free bits because of alignment, but that also seems like a really nasty alignment constraint to add.Anyway, those are three options I can see for removing this allocation. Do any of these seem worth taking? Option 2 seems most promising to me, I'm just curious, @udoprog, if you think this is the right tradeoff for Unicycle.