eholk commented 2 years ago

Internals::clone current does a Box::new when cloning a waker. It'd be nice to remove this if possible.

I've spent some time trying to do this but I'm not a huge fan of any of the solutions I've come up with. I thought I'd open a discussion here to see what others think.

Here are some options I had:

Option 1: Always put Internals in an Arc

This means Internals::clone would just be a call to Arc::clone, so we don't do a new allocation. The downside is that creating a new waker would always require a heap allocation, which is strictly worse than the current state of things where clone seems to be a relatively rare operation.

Option 2: Pre-allocate wakers

There's a couple of ways this could look, but basically it would be either adding an extra waker_slab field to Unordered that holds wakers for each task, or by changing the existing slab to hold (Internals, T) and keep the waker there (whether this is safe depends on whether wakers can outlive the task they wake, which hopefully is true but I'm not sure it has to be). For the whole solution to work, we also need to make sure the Unordered outlives all of its wakers as well.

This solution is nice because we just need to allocate space for the wakers once and then we can reuse them. Cloning becomes just a pointer copy.

The downside is that it uses more space. It looks like about two words per future in the set. Maybe that's not terrible, but I know one of the goals for unicycle is reducing overhead. It's a case of trading off space usage versus allocation overhead, and it's not clear which is more important.

Option 3: Pointer tricks

This is almost certain not to work. The reason we need an allocation in the first place is that RawWaker only has a pointer to data and a vtable, but we need to store both a pointer to the parent waker as well as the index of the task to wake. The idea here would be to find some way to store the task index in some unused bits in the pointer to the parent waker.

On Intel architectures, we could probably get away with this currently, since only 48 address bits are used. That gives us 16 to store the task index, plus maybe another 2 to 4 due to alignment. This means we could conceivably support about a million tasks this way.

That said, I don't think we can rely on all architectures having this limit, and I'm sure Intel architectures will eventually expand beyond 48 bits as well. I guess we could do something like #[align(1<<20)] on the SharedWaker, which would guarantee 20 free bits because of alignment, but that also seems like a really nasty alignment constraint to add.

Anyway, those are three options I can see for removing this allocation. Do any of these seem worth taking? Option 2 seems most promising to me, I'm just curious, @udoprog, if you think this is the right tradeoff for Unicycle.

udoprog commented 2 years ago

Thanks for the careful thoughts!

Option 1 Probably the best bet. Doing most of the work up front or at insertions time is acceptable. The important aspect in my mind is to allow polling to be efficient since it is such a frequent and potentially spurious operation.

Option 2 seems more akin to what most schedulers will do, only gripe is that task storage can't be re-used as it currently stands because they need to be evicted immediately once they complete to free up the ID. Using a Slab for pre-allocation would require insertions and removals to be synchronized which might be hard.

The only way I'm aware of to truly do no allocations would be to statically allocating wakers, but it has many limitations.

I think 2 is doable and would work fine. All though it would be nice w/ benchmarks all though admittedly the current state of affairs would mostly compare how good of a small object allocator the system allocator is.

eholk commented 2 years ago

15 implemented a variant of Option 2, so I think we can close this now.

There are still more improvements that could be made, such as doing this in a way that will_wake works. See this comment for more details.

loyd commented 1 year ago

Yep, will_wake would be a great improvement. Now I'm forced to clone a waker one every time when underlying stream returns Pending =(

udoprog / unicycle

Is it possible to remove allocation in Internals::clone? #14

Option 1: Always put Internals in an Arc

Option 2: Pre-allocate wakers

Option 3: Pointer tricks

15 implemented a variant of Option 2, so I think we can close this now.