Consider using st3 for work-stealing

james7132 commented 1 year ago

st3 looks to be a fairly promising public crate for implementing tokio's !Send local worker queues which eliminate most of the atomic operations when popping from the local queue.

Right now async_executor uses concurrent_queue for these purposes, but realistically only the workstealing operations need to be threadsafe, the local worker popping from the queue does not actually need to be Send.

Some additional scrutiny on it's thread safety model and some API compatibility additions are definitely needed (see asynchronics/st3#2).

notgull commented 1 year ago

In theory, I'd like to avoid adding things like this to this crate. One of the strengths of this crate (as well as most of the other crates in this repository) is that the code is easy to reason around. It makes it easier to understand it at first glance, and makes it easier to trust it. While the current work-stealing implementation can be improved (I actually have a branch where I'm working on this), I think that keeping it simple is probably the best option. There are other crates for those who want more optimized executors.

That being said, this brings up another issue. The local queues shouldn't really have to be thread-safe that often. Could we replace them with a Mutex<VecDeque> instead of a ConcurrentQueue, now that Mutexes are more finely grained?

james7132 commented 1 year ago

In theory, I'd like to avoid adding things like this to this crate. One of the strengths of this crate (as well as most of the other crates in this repository) is that the code is easy to reason around. It makes it easier to understand it at first glance, and makes it easier to trust it. While the current work-stealing implementation can be improved (I actually have a branch where I'm working on this), I think that keeping it simple is probably the best option. There are other crates for those who want more optimized executors.

I'd argue that the user facing logic for st3 is as simple (or even simpler) than concurrent_queue. The only thing that might be more warranted is the underlying thread safety model, which is both being tested by loom and miri on both ends.

That being said, this brings up another issue. The local queues shouldn't really have to be thread-safe that often. Could we replace them with a Mutex instead of a ConcurrentQueue, now that Mutexes are more finely grained?

Isn't this still going to be about the same overhead? The mutex will always at least incur one atomic fence as it checks the lock status if not a full syscall if it's actually locked.

notgull commented 1 year ago

The only thing that might be more warranted is the underlying thread safety model, which is both being tested by loom and miri on both ends.

I’d rather avoid depending on something like this if it’s still unstable. I’ll leave this issue open for once st3 is in a more stable state.

Isn't this still going to be about the same overhead? The mutex will always at least incur one atomic fence as it checks the lock status if not a full syscall if it's actually locked.

In smol-rs/event-listener#34, I found that a concurrent-queue was slower than a naïve singly atomically linked list when the list is expected to be empty. Given that the concurrency among the local queues is likely to be very low, I think it’s worth at least an experiment.

james7132 commented 2 months ago

Tested out this and saw some substantial improvements in the single-threaded cases, but a major regression in some of the multithreaded cases. The single threaded cases make sense given that popping off a local worker no longer requires any atomic fences. However, it's unclear why there's such a large regression in the spawn_one and spawn_batch benchmarks, while the other multithreaded cases are so mixed.

I suspect it has some unusual interactions with the locking we're doing while stealing/spawning tasks. I'll test this out again with the benchmarks in #112 soon.

executor::create        time:   [680.82 ns 681.07 ns 681.36 ns]
                        change: [-38.393% -38.318% -38.236%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

single_thread/executor::spawn_one
                        time:   [988.49 ns 1.0130 µs 1.0432 µs]
                        change: [-32.591% -27.985% -22.563%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
single_thread/executor::spawn_batch
                        time:   [35.322 µs 38.898 µs 43.128 µs]
                        change: [-43.524% -36.397% -28.635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
single_thread/executor::spawn_many_local
                        time:   [4.8108 ms 4.8590 ms 4.9103 ms]
                        change: [-9.3121% -7.8849% -6.5257%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
single_thread/executor::spawn_recursively
                        time:   [39.891 ms 40.090 ms 40.296 ms]
                        change: [-21.519% -20.623% -19.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
single_thread/executor::yield_now
                        time:   [4.5668 ms 4.6005 ms 4.6468 ms]
                        change: [-13.465% -12.791% -11.894%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

multi_thread/executor::spawn_one
                        time:   [15.152 µs 15.592 µs 15.942 µs]
                        change: [+575.08% +618.49% +667.97%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_batch
                        time:   [57.882 µs 62.464 µs 67.913 µs]
                        change: [+59.846% +70.150% +79.827%] (p = 0.00 < 0.05)
                        Performance has regressed.
multi_thread/executor::spawn_many_local
                        time:   [28.761 ms 28.858 ms 28.954 ms]
                        change: [-8.3918% -6.4482% -4.4179%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
Benchmarking multi_thread/executor::spawn_recursively: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 18.2s, or reduce sample count to 20.
multi_thread/executor::spawn_recursively
                        time:   [180.14 ms 180.45 ms 180.79 ms]
                        change: [-3.0653% -2.7293% -2.4013%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
multi_thread/executor::yield_now
                        time:   [23.286 ms 23.361 ms 23.432 ms]
                        change: [+4.6276% +5.2948% +5.9920%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

smol-rs / async-executor

Consider using st3 for work-stealing #32