tokio-rs / tokio

A runtime for writing reliable asynchronous applications with Rust. Provides I/O, networking, scheduling, timers, ...
https://tokio.rs
MIT License
26.39k stars 2.43k forks source link

rt: Block-based bounded Workstealing queue for tokio #5240

Open jschwe opened 1 year ago

jschwe commented 1 year ago

BWoS Block-based bounded Workstealing queue for tokio

We have developed a block based workstealing queue based on BBQ which can considerably improve the performance of the multi threaded scheduler backend. E.g. for hyper-based webframeworks we measured around 20% throughput improvement on an x86 server (benchmarked a simple Hello-World echo with rewrk). Our queue is intended as an alternative to the existing queue in the multi-threading scheduler. We are opening this issue to first get some general feedback from the tokio-maintainers before opening a pull-request

Background: The BBQ (Block-based Bounded Queue)

At ATC22 the BBQ was presented, a novel ringbuffer design that splits the entire buffer into multiple blocks. The performance of the BBQ is very good, especially in the the SPSC scenario. For example, in SPSC micro-benchmarks, BBQ yields 11.3x to 42.4x higher throughput than the ringbuffers from Linux kernel, DPDK, Boost, and Folly libraries. In real-world scenarios, BBQ achieves up to 1.5x, 50.5x, and 11.1x performance improvements in benchmarks of DPDK, Linux io_uring, and Disruptor, respectively.

Limitations of the current queue in tokio

The current queue in tokio is a variant of the SPMC queue type. The performance is limited by the compare-swap operation on the consumer/thief side. The stealing operation can also cause a cache miss on the owner side similar to the enq-deq interference described in the BBQ paper.

About the BWoS queue

The BWoS queue is based on the BBQ (Block-based Bounded Queue) and is specially designed for the workstealing scenario. The block-based design allows the thieves to steal from the middle and thus prevents the stealing from slowing down the Owner (Producer+Consumer). It uses block-level synchronization to ensure that the single consumer always operates on a different block than the thieves and therefore removes consumer-thief synchronization inside the block. We evaluated the BWoS queue in several micro and end-to-end benchmarks which show that this queue design can offer significant improvements over the current queue, e.g. in the scenario of an http server such as hyper.

Practical Verification

We have used practical verification based on GenMC and VSync to verify that the atomic operations of the queue algorithm are sufficiently strong. There is ongoing work to verify the Rust code directly, but that is not ready yet.

Semantic differences between BWoS and the current queue design

We achieve the huge performance improvement by requiring that the consumer and thieves operate on different blocks. However, if there is only one block in the queue with data, then stealing is not possible. Therefore the original queue may be better suited in scenarios with low queue utilizations and load imbalances. We propose the BWoS queue as an alternative for the user to select based on their usage scenario.

Preliminary Benchmark results

Queue Microbenchmarks: Performance without stealing

In the first microbenchmark we simply enqueue until the queue is full and then deque until the queue is empty (no stealing involved). One element is a u64. criterion is used for the benchmark measurements. The BWos queue column contains a range since the performance varies depending on the number of elements per block. The lower bound corresponds to a queue size of 256 divided into 8 blocks with 32 elements each. The upper bound is for a queue with 8 blocks and 1024 elements each. The performance of the original tokio queue stays the same regardless of the total queue size.

Hardware BWoS queue Original tokio queue speedup
Hi1612 (Cortex-A57) 75.6 - 79.8 Melem/s 28.8 Melem/s 2.6x - 2.7x
Kunpeng 920 225.0 - 176.6 Melem/s 54.5 Melem/s 4.1x - 3.2x
Intel Xeon 6278C 465 - 600 Melem/s 137 Melem/s 3.3x - 4.3x

Note: The performance drop for the BWoS queue on the kunpeng 920 for 8 block of 1024 u64 elements is likely because the L1 cache on the kunpeng 920 is the same size as the queue.

Queue Microbenchmarks: Performance with stealing

In this microbenchmark we evaluate the queue performance with a variable percentage of stolen items. The advantage of the BWoS queue increases together with the percentage of stolen items. This benchmark was run on a server with an Intel Xeon 6266C with the BWoS queue configured with 8 blocks and 1024 elements per block.

Stealing Percentage Original tokio queue Median MOp/s BWoS queue Throughput Median MOp/s Throughput improvement
0 154.5 554.2 3.6x
1 130.9 517.7 4.0x
2 106.5 501.7 4.7x
5 77.2 475.2 6.2x
10 49.2 467.8 9.5x
20 30.6 468.1 15.3x

Tokio Microbenchmarks

Some of the tokio benchmarks seem to vary greatly each run, but the general picture shows a performance improvement.

cargo benchcmp output (Intel Xeon 6278C)

``` name tokio_old_queue_bench.log ns/iter tokio_bwos_bench.log ns/iter diff ns/iter diff % speedup async_read_codec 17,439,173 16,984,672 -454,501 -2.61% x 1.03 chained_spawn 187,157 196,417 9,260 4.95% x 0.95 contended_concurrent_multi 21,047 19,056 -1,991 -9.46% x 1.10 contention_bounded 1,322,048 1,143,263 -178,785 -13.52% x 1.16 contention_bounded_full 1,772,407 1,674,274 -98,133 -5.54% x 1.06 ping_pong 1,571,164 1,539,545 -31,619 -2.01% x 1.02 read_concurrent_uncontended_multi 19,984 20,718 734 3.67% x 0.96 send_medium 652 669 17 2.61% x 0.97 spawn_many 6,876,600 7,647,182 770,582 11.21% x 0.90 threaded_scheduler_spawn 18,532 19,688 1,156 6.24% x 0.94 threaded_scheduler_spawn10 20,838 20,420 -418 -2.01% x 1.02 uncontended_concurrent_multi 18,963 20,702 1,739 9.17% x 0.92 uncontented_unbounded 349,593 337,544 -12,049 -3.45% x 1.04 ```

Hyper Microbenchmarks

We only show the results with a difference of more than 2%. Again, some of the builtin benchmarks vary greatly between two runs.

cargo benchcmp output (Intel Xeon 6278C)

``` name hyper_original_tokio_bench.log ns/iter hyper_bwos_tokio_bench.log ns/iter diff ns/iter diff % speedup aggregate::bytes_10_000_count_1 252 (39682 MB/s) 180 (55555 MB/s) -72 -28.57% x 1.40 aggregate::bytes_10_000_count_10 595 (168067 MB/s) 525 (190476 MB/s) -70 -11.76% x 1.13 aggregate::bytes_1_000_count_10 595 (16806 MB/s) 525 (19047 MB/s) -70 -11.76% x 1.13 aggregate::bytes_1_000_count_2 283 (7067 MB/s) 212 (9433 MB/s) -71 -25.09% x 1.33 http1_consecutive_x1_both_100kb 60,980 (3358 MB/s) 58,774 (3484 MB/s) -2,206 -3.62% x 1.04 http1_consecutive_x1_empty 23,346 22,812 -534 -2.29% x 1.02 http1_parallel_x10_req_10mb 46,945,488 (2233 MB/s) 45,836,062 (2287 MB/s) -1,109,426 -2.36% x 1.02 http1_parallel_x10_res_1mb 3,708,796 (2827 MB/s) 4,091,324 (2562 MB/s) 382,528 10.31% x 0.91 http2_consecutive_x1_empty 41,613 40,321 -1,292 -3.10% x 1.03 http2_consecutive_x1_req_100kb 97,574 (1049 MB/s) 95,488 (1072 MB/s) -2,086 -2.14% x 1.02 http2_consecutive_x1_req_10b 49,707 47,258 -2,449 -4.93% x 1.05 http2_parallel_x10_empty 137,018 133,132 -3,886 -2.84% x 1.03 manual_into_vec::bytes_10_000_count_1 339 (29498 MB/s) 279 (35842 MB/s) -60 -17.70% x 1.22 manual_into_vec::bytes_10_000_count_10 11,610 (8613 MB/s) 12,184 (8207 MB/s) 574 4.94% x 0.95 manual_into_vec::bytes_1_000_count_10 801 (12484 MB/s) 763 (13106 MB/s) -38 -4.74% x 1.05 manual_into_vec::bytes_1_000_count_2 344 (5813 MB/s) 275 (7272 MB/s) -69 -20.06% x 1.25 to_bytes::bytes_10_000_count_1 242 (41322 MB/s) 166 (60240 MB/s) -76 -31.40% x 1.46 to_bytes::bytes_10_000_count_10 11,793 (8479 MB/s) 12,475 (8016 MB/s) 682 5.78% x 0.95 to_bytes::bytes_1_000_count_10 842 (11876 MB/s) 776 (12886 MB/s) -66 -7.84% x 1.09 to_bytes::bytes_1_000_count_2 366 (5464 MB/s) 280 (7142 MB/s) -86 -23.50% x 1.31 ```

Rust Web benchmarks

Rust web benchmarks is a tool to automatically benchmark different rust web frameworks with the rewrk tool. The table shows the throughput improvement on a Intel Xeon 6278C with 16 cores and a queue size of 256 entries (8 blocks with 32 elements for BWoS), when replacing tokio with our modified tokio using the BWoS queue.

actix-web, astra, ntex and tide where removed from the table, since they don't use the multithreaded runtime.

Web Framework BWoS throughput Improvement
axum 1.21x
hyper 1.19x
poem 1.17x
rocket 1.17x
salvo 1.20x
viz 1.20x
warp 1.20x
Raw summary

## Using original tokio | Framework Name | Latency.Avg | Latency.Stdev | Latency.Min | Latency.Max | Request.Total | Request.Req/Sec | Transfer.Total | Transfer.Rate | Max. Memory Usage | |---|---|---|---|---|---|---|---|---|---| |actix-web|0.94ms|0.69ms|0.03ms|29.07ms|15997992|533347.69|1.94GB|66.12MB/Sec|12.7MB| |astra|1.00ms|0.67ms|0.04ms|205.34ms|7093106|236470.13|723.82MB|24.13MB/Sec|5.0MB| |axum|1.48ms|0.96ms|0.03ms|15.29ms|10156677|338607.36|1.23GB|41.98MB/Sec|13.0MB| |hyper|1.37ms|0.97ms|0.03ms|17.32ms|10918061|364006.04|926.69MB|30.90MB/Sec|12.7MB| |ntex|1.18ms|0.76ms|0.03ms|21.60ms|12721268|424135.80|1.53GB|52.18MB/Sec|8.4MB| |poem|1.44ms|0.96ms|0.03ms|18.01ms|10411024|347101.63|1.26GB|43.03MB/Sec|16.4MB| |rocket|1.96ms|1.22ms|0.05ms|19.07ms|7653164|255153.88|1.77GB|60.35MB/Sec|17.8MB| |salvo|1.45ms|0.95ms|0.03ms|16.23ms|10338227|344657.08|1.25GB|42.73MB/Sec|14.9MB| |thruster|20.31ms|5.93ms|0.58ms|169.81ms|737976|24601.49|0.00B|0.00B/Sec|6.6MB| |tide|44.61ms|2.23ms|0.05ms|49.78ms|334801|11161.55|41.24MB|1.37MB/Sec|22.8MB| |viz|1.42ms|0.97ms|0.03ms|17.48ms|10522233|350807.36|1.27GB|43.49MB/Sec|12.5MB| |warp|1.40ms|0.96ms|0.03ms|16.53ms|10702870|356820.39|1.30GB|44.24MB/Sec|14.5MB| ## Using BWoS tokio | Framework Name | Latency.Avg | Latency.Stdev | Latency.Min | Latency.Max | Request.Total | Request.Req/Sec | Transfer.Total | Transfer.Rate | Max. Memory Usage | |---|---|---|---|---|---|---|---|---|---| |actix-web|1.02ms|1.04ms|0.03ms|21.27ms|14719146|490680.92|1.78GB|60.83MB/Sec|13.2MB| |astra|0.88ms|0.65ms|0.03ms|209.66ms|8158255|271969.12|832.51MB|27.75MB/Sec|5.1MB| |axum|1.22ms|0.96ms|0.03ms|25.11ms|12303633|410204.96|1.49GB|50.86MB/Sec|13.2MB| |hyper|1.15ms|0.91ms|0.02ms|23.77ms|13011242|433753.41|1.08GB|36.82MB/Sec|13.2MB| |ntex|1.21ms|1.28ms|0.03ms|29.19ms|12406456|413609.40|1.49GB|50.88MB/Sec|8.5MB| |poem|1.22ms|0.98ms|0.03ms|22.55ms|12277539|409327.01|1.49GB|50.75MB/Sec|15.2MB| |rocket|1.67ms|1.51ms|0.04ms|32.80ms|8961411|298777.46|2.07GB|70.66MB/Sec|18.8MB| |salvo|1.20ms|0.97ms|0.03ms|23.28ms|12478917|416008.48|1.51GB|51.58MB/Sec|13.2MB| |thruster|23.15ms|8.08ms|0.18ms|275.49ms|647365|21582.67|0.00B|0.00B/Sec|7.7MB| |tide|44.30ms|2.20ms|0.06ms|201.30ms|336457|11215.61|41.44MB|1.38MB/Sec|24.9MB| |viz|1.18ms|0.94ms|0.03ms|20.18ms|12667001|422280.60|1.53GB|52.35MB/Sec|12.3MB| |warp|1.16ms|0.92ms|0.03ms|18.70ms|12943961|431528.45|1.57GB|53.50MB/Sec|13.5MB|

Future Codesign

For our initial proposal we keep the semantics of our queue as close as possible to the original queue implementation (e.g., producer and consumer on same core) so that we don't need to change any upper-layer code (e.g., worker.rs). Thanks to our block-based design, we could also provide a run-queue where the producer and consumer don't have to be on the same thread/core (by only adding one barrier to the existing BWoS queue) while still having similar performance. We would like to know if this could simplify/improve the upper layer design or not.

Open Questions

Noah-Kennedy commented 1 year ago

This feels like a good candidate for a new runtime flavor.

Regarding the MSRV, a 1.58 bump would be within our MSRV policy as that version landed in January.

Darksonn commented 1 year ago

It sounds interesting. The fact that stealing can't happen within a block prevents this from becoming the default, but having it be optional using a flavor or some other type of option could be a possibility.

As for MSRV, we generally only bump the MSRV when there's a good reason to do so, even if the bump is allowed by the 6 month policy. Maybe a bump is warranted, but it depends on how cumbersome it is to work around.

jschwe commented 1 year ago

The fact that stealing can't happen within a block prevents this from becoming the default, but having it be optional using a flavor or some other type of option could be a possibility.

We figured as much. To be a bit more precise, stealing can't happen on the same block the consumer is on, so this only is visible if the queue has a low utilization.

As for MSRV, we generally only bump the MSRV when there's a good reason to do so, even if the bump is allowed by the 6 month policy. Maybe a bump is warranted, but it depends on how cumbersome it is to work around.

Our queue can be configured by choosing a suitable amount of blocks and the number of items per block. If there are more elements per block than the queue is able to use the fast path (inside one block) more often, which improves producer/consumer performance. We could just hardcode a value, but that prevents the user from fine-tuning the queue to their usecase. Maybe some tokio users would also be interested in general to be able to configure the queue size, so that the global inject queue is used less often?

We currently do use a couple of other things which are not stable on 1.51, but if I remember correctly it was nothing too important and can be worked around (I think const asserts where the only other major "nice to have" thing).

If you are interested I would prepare a draft PR. For the initial review I would propose to just hardcode our queue as a replacement, since that makes it easier for reviewers to benchmark changes by simply switching out tokio in their project with our patched version.

Darksonn commented 1 year ago

Sure, I'd be happy to see a PR.

Regarding the hard-coding of queue sizes, this only requires const generics if you want the size to be a compile-time constant. Maybe that is important for the performance of the queue — but it also has the disadvantage that the user cannot configure it to a value chosen at runtime (e.g. from a config file).