Efficient way of batch spawning a large number of tasks

james7132 commented 9 months ago

Bevy's use of async_executor tends to be very bursty, with a single thread that adds tens or hundreds of tasks in short bursts. This might be running up against the executor's need to acquire a lock on the internal state for every task that is spawned or terminated. Is it possile to expose an function on Executor that would allow the caller to grab the lock once and the spawn an arbitrary number of tasks before releasing it?

notgull commented 9 months ago

The issue is that it would block the executor thread if a task finished while this "batch spawning" was going on it would block the executor thread. So we have to be really careful about not holding that lock for too long.

I'm open to potential designs but I would be skeptical of one where the user controls the lock, as that would be easy to misuse.

james7132 commented 9 months ago

I can see that blocking the finalization of tasks might be an issue, though arguably the contention from constantly acquiring and releasing the lock while tasks are finalizing is also a potential issue too.

Several potential options here:

A SpawnScope<'a> struct that allows spawning while holding the lock, releasing the lock on drop. This is potentially hazardous for the aforementioned reasons, but provides the full featured API of spawning tasks as you can call SpawnScope::spawn with any future type you want.
A spawn_batch<T: Future<Output=...>>(&mut Vec<T>) alternative to spawn. This prevents the caller from injecting extra code while holding the lock, so it may be preferable, but it locks the caller into only allowing one task type when spawning, it also forces the allocator into the loop when spawning the batch, so it's also potentially suboptimal. A version of this that takes impl IntoIterator<T: Future> likely will have the same issue as the SpawnScope implementation.
Swap the Mutex<Slab<Waker>> out for sharded-slab instead for a lock-free alternative and remove the lock entirely. Unsure what the performance impact here would be without benchmarking of some kind.

notgull commented 9 months ago

I added a benchmark to see what would happen if I spawned 250 futures at once. Here is what I got:

single_thread/executor::spawn_batch
                        time:   [41.946 µs 42.943 µs 43.965 µs]
multi_thread/executor::spawn_batch
                        time:   [270.41 µs 290.70 µs 309.13 µs]

Just as a baseline. I added a method that took an IntoIterator<Item = Future> and used that instead, and re-ran the benchmarks:

single_thread/executor::spawn_batch
                        time:   [16.792 µs 16.932 µs 17.087 µs]
                        change: [-65.241% -63.882% -62.424%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  13 (13.00%) high mild
  5 (5.00%) high severe
multi_thread/executor::spawn_batch
                        time:   [27.474 µs 31.257 µs 35.684 µs]
                        change: [-91.316% -89.395% -87.004%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

So this would indeed significantly improve performance when spawning multiple tasks at once. We could theoretically get around the locking issue by releasing the lock every once in a while to ease contention.

Out of curiosity I also wanted to see what would happen if I used sharded-slab. It has qualities that I believe make it impractical for use in the executor, but if the performance wins are good enough I might be able to look past that and find a way to work around it.

single_thread/executor::spawn_one
                        time:   [2.4726 µs 2.5359 µs 2.6246 µs]
                        change: [-41.037% -35.820% -29.761%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) high mild
  9 (9.00%) high severe
single_thread/executor::spawn_batch
                        time:   [71.701 µs 77.016 µs 85.338 µs]
                        change: [+246.59% +285.70% +328.64%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 19 outliers among 100 measurements (19.00%)
  19 (19.00%) high severe
single_thread/executor::spawn_many_local
                        time:   [10.906 ms 11.610 ms 12.421 ms]
                        change: [+62.538% +81.006% +101.72%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  13 (13.00%) high severe
single_thread/executor::spawn_recursively
                        time:   [42.122 ms 43.602 ms 45.543 ms]
                        change: [-20.513% -17.547% -13.147%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

...

multi_thread/executor::spawn_one
                        time:   [4.5598 µs 5.0076 µs 5.4321 µs]
                        change: [+3.3844% +13.267% +24.924%] (p = 0.01 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_batch
                        time:   [210.44 µs 226.03 µs 239.71 µs]
                        change: [+548.76% +681.84% +838.41%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_many_local
                        time:   [18.928 ms 20.031 ms 21.129 ms]
                        change: [-16.292% -9.8637% -2.3774%] (p = 0.01 < 0.05)
                        Performance has improved.
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.1s, or reduce sample count to 40.
Benchmarking multi_thread/executor::spawn_recursively: Collecting 100 samples in estimated 10.060 s (100 iterations)error: bench failed, to rerun pass `--bench executor`

Caused by:
  process didn't exit successfully: `/hard_data/cargo_target/release/deps/executor-68f9fd4546a1882d --bench` (signal: 9, SIGKILL: kill)

So it doesn't look like sharded-slab is worth it. Not sure why it got SIGKILL'd.

smol-rs / async-executor

Efficient way of batch spawning a large number of tasks #91