Boxing large futures is unsubstantiated

zetanumbers commented 4 weeks ago

In the context of da62e65e64bcd54e3d427448069f47a9b1c3b6bd, pay attention to this code:

// src/runnable.rs:515

// Allocate large futures on the heap.
let ptr = if mem::size_of::<Fut>() >= 2048 {
    let future = |meta| {
        let future = future(meta);
        Box::pin(future)
    };

    RawTask::<_, Fut::Output, S, M>::allocate(future, schedule, self)
} else {
    RawTask::<Fut, Fut::Output, S, M>::allocate(future, schedule, self)
};

The future is boxed whenever its size is greater or equal to 2048 bytes. Comment tries to elaborate on this condition but it seems not to notice that RawTask::allocate already places the future on the heap next to the header. I don't see any reason to additionally box the future even if it is large by itself.

However one might argue that running future shares space with its output result after the completion, thus boxed future would deallocate the space which that future would have occupied as soon as it completes, possibly saving some space until its Task object is consumed in some way. Then however the right condition would be to compare future's size with its output type's size, to make sure there is a substantial difference between them to save on:

mem::size_of::<Fut>() >= mem::size_of::<Result<Fut::Output, Panic>> + 2048

Even then that approach would be unsubstantiated, due to not having data to show any practical benefit. There could be not enough delay while completed task is held in memory to cause substantial memory usage cost.

It also introduces an additional pointer indirection which cannot be avoided, even if user would have liked to do so. However in case this branch is removed, user would be able to box up the future on their side to reach the old behavior. Given this library's purpose is to use it to "create a custom async executor", I am inclined to believe latter choice would be more appropriate, while this behavior and possible optimization should probably be noticed in the documentation as well.

I've even made some simple benchmarks to check how the code would perform with and without the Box::pin branch:

#[bench]
fn large_task_create(b: &mut Bencher) {
    b.iter(|| {
        let data = [0; 0x40000];
        let _ = async_task::spawn(
            async move {
                black_box(data);
            },
            drop,
        );
    });
}

#[bench]
fn large_task_run(b: &mut Bencher) {
    b.iter(|| {
        let data = [0; 0x40000];
        let (runnable, task) = async_task::spawn(
            async move {
                black_box(data);
            },
            drop,
        );
        runnable.run();
        future::block_on(task);
    });
}

I have found no noticeable difference, while roughly accounting for instability of benchmark results.

This issue also relates to #66, which highlights one of drawbacks of this decision.

taiki-e commented 4 weeks ago

AFAIK, boxing is to avoid stack overflows. However, I think the boundary should be larger in the release mode like tokio does. https://github.com/tokio-rs/tokio/blob/512e9decfb683d22f4a145459142542caa0894c9/tokio/src/runtime/mod.rs#L400

Another approach that would may work is to use #[inline(always)] to avoid intermediate allocations on the stack.

zetanumbers commented 4 weeks ago

AFAIK, boxing is to avoid stack overflows.

That would be good, but I only can see the opposite being true. Particularly if you change data array size from 0x4_0000 to 0x10_0000 from aforementioned benchmarks you'll get the stack overflow until you actually delete that branch. This indicates that the optimizer is able to perform in-place initialization, even if it doesn't use Box type, which does make sense considering there's no significant difference for LLVM. However it also seem like that the optimizer struggles with the added complexity from this if branch.

There could have been a possibility that this change did help to avoid stack overflow, considering it did not create a new closure back then change was introduced, but called Box::pin directly inside of the discussed branch (see d3011bfde9ef119674710f57895c545383eedf7b). Even if that was true (besides that nobody paid attention or got any stack overflows I've encountered), you can make a conclusion this feature is not functional.

smol-rs / async-task

Boxing large futures is unsubstantiated #83