jkarneges / rust-async-bench

The cost of Rust async/await
101 stars 1 forks source link

Rust Async Benchmark

This project compares the performance of a manually written event loop vs async/await, in Rust.

To be clear, this is not a comparison of thread pools vs coroutines. It's also not a comparison of async runtimes. It's a comparison of single-threaded manually written event loop code vs single-threaded async/await code, to determine the minimum overhead of adopting async in a Rust application.

Goals

How it works

In order to simulate a somewhat realistic application style, the benchmarks implement a fake network server that responds to requests. The fake server accepts "connections", which are bidirectional streams of bytes. For each connection, it reads a line of text as a request, then it writes a line of text as a response. Multiple variations of this fake server are implemented. One variation uses a manually written event loop and the rest use async.

Each async benchmark uses one of the following executors:

The benchmarks are:

Additionally, there are variations of these benchmarks that make I/O syscalls, suffixed with +syscalls.

Each benchmark performs 256 request/response transactions.

I/O counts

To count I/O calls for the manual event loop and minimal async executor, run cargo run:

$ cargo run
    Finished dev [unoptimized + debuginfo] target(s) in 0.03s
     Running `target/debug/rust-async-bench`
manual: register=257 unregister=257 poll=258 accept=512 read=512 write=512
async: register=257 unregister=257 poll=258 accept=512 read=512 write=512

(Note: since this only counts operations and is not a speed test, the build mode doesn't matter.)

Running cargo test will verify these expected counts for all implementation variations.

Benchmarks

To measure the speed of the manual event loop vs. the various async implementations/configurations, run cargo bench.

Some results running on Linux:

manual                  time:   [26.262 µs 26.270 µs 26.279 µs]
nonbox                  time:   [88.471 µs 88.497 µs 88.529 µs]
callerbox               time:   [91.892 µs 91.907 µs 91.926 µs]
large+nonbox            time:   [189.18 µs 189.24 µs 189.31 µs]
box                     time:   [96.430 µs 96.447 µs 96.469 µs]
box+callerbox           time:   [98.313 µs 98.329 µs 98.347 µs]
large+box               time:   [212.57 µs 212.72 µs 212.92 µs]
box+rc                  time:   [98.380 µs 98.399 µs 98.422 µs]
box+chkrc               time:   [114.84 µs 114.86 µs 114.88 µs]
box+arc                 time:   [103.19 µs 103.21 µs 103.23 µs]
manual+syscalls         time:   [988.31 µs 988.57 µs 988.83 µs]
nonbox+syscalls         time:   [1.0816 ms 1.0821 ms 1.0825 ms]
box+syscalls            time:   [1.0965 ms 1.0967 ms 1.0970 ms]
box+rc+syscalls         time:   [1.0981 ms 1.0984 ms 1.0987 ms]
box+chkrc+syscalls      time:   [1.1174 ms 1.1179 ms 1.1185 ms]
box+arc+syscalls        time:   [1.1086 ms 1.1089 ms 1.1092 ms]

Analysis

Summary

Costs

All of the async benchmarks are slower than the manual benchmark, so async Rust has a measurable cost.

That said, async Rust does not require the use of any conventionally costly operations, such as heap allocs, atomics, mutexes, or thread local storage, nor does it require introducing extra syscalls, buffer copies, or algorithms with worse big O complexity, compared to what would be required by a manually written poll loop. Not even a hashmap is required. At minimum, the required costs involve shuffling around bools, ints, and references, and using collections with fast O(1) operations such as arrays and linked lists. This is proven by ArgExecutor.

Boxing tasks has an additional cost (see nonbox vs. box and large+nonbox vs. large+box). This is not really surprising.

Embedding wakers in existing data structures instead of separately allocating them doesn't make a meaningful difference (see box vs. box+rc).

Arc-based wakers are slower than unsafe Rc-based wakers (see box+rc vs. box+arc). However, they are faster than checked Rc-based wakers (see box+chkrc vs. box+arc).

Relative costs

In a real application, task accounting should normally be a very small part of overall compute time. The benchmarks with syscalls help highlight this. These benchmarks make a non-blocking call to libc::read whenever there is an I/O operation. The read is against a pipe that never has data in it, and so the call always returns an error. Adding in these syscalls significantly reduces the time differences between the benchmarks. For example, manual is 70% faster than nonbox, but manual+syscalls is only 8.6% faster than nonbox+syscalls (or reversed, nonbox+syscalls is 9.4% slower).

These no-op syscalls only scratch the surface in terms of what a real application might do. In a real application, I/O syscalls will likely do meaningful things (such as read non-zero amounts of data) and thus cost more. Real applications will also perform real application logic. The benchmarks test 256 requests. If an application were to spend even 10 microseconds per request doing meaningful work, the manual event loop would only be 2.5% faster.

Another way of looking at it: the difference between manual and nonbox is 62.2us. Divided by 256, that's an overhead of around 243 nanoseconds for async execution per request. In a real app, that's practically free.

Boxing futures

Boxing futures has a small cost (nonbox+syscalls is 1.3% faster than box+syscalls). However, not boxing has drawbacks too: all futures take up the same amount of space, and spawning needs to be done indirectly to avoid circular references. Also, unless the space for all futures is allocated up front, allocations will need to be made at runtime anyway.

Given the small relative cost of boxing, the cost is well worth it for the flexibility and ergonomics it brings. A high performance network server running on a typical OS should box its futures.

The only time to avoid boxing might be in hard real-time applications, for example games with frame-rendering deadlines, or embedded systems without support for allocations.

Rc vs Arc wakers

In theory, Rc-based wakers have legitimate value, in that they are faster than Arc-based wakers (box+rc+syscalls is 0.9% faster than box+arc+syscalls) and can simply be dropped in where applicable (single-threaded executors). Unfortunately, it is currently not possible to use Rc-based wakers safely without sacrificing their performance gains.

Using unsafe wakers is probably not worth it for their small relative gains. It seems like it would be a hard thing to audit. Waker construction could be marked unsafe, but waker proliferation and use would not be.

Arc-based wakers are faster than safe Rc-based wakers, at least when there is no contention on the wakers between multiple threads. And if you're using single-threaded executors, then there should be no contention. This means Arc-based wakers are the safest, fastest choice for either single-threaded or multi-threaded executors.

Zero cost?

Is async Rust a "zero cost abstraction"?

Let's start by quoting Bjarne Stroustrup's definition: "What you don't use, you don't pay for. And further: What you do use, you couldn't hand code any better."

If you don't use async at all, then you don't pay anything for it. That much is true. However, the manual benchmark beats nonbox, suggesting that a non-blocking event loop can be hand-coded better than by using async.

Async Rust has different goals than that manually written code though, such as encapsulation (hiding everything behind a poll() method) and reactor/executor decoupling (wakers). If you've already bought into using the Future trait, then async generators may very well be zero cost. But if your hand written code wouldn't normally have used Future or something like it, then adopting the whole of async will have a cost. Speaking for myself, I've never implemented Future-like encapsulation or decoupling in my own event loop code. Probably this is because the code was always application-specific and not part of a composable framework.

The relative cost of async is low though, and it comes with the huge benefit of being able to write code that is much easier to reason about and to extend. Again speaking for myself, I'll admit I've made compromises in the control flows of my own event loop code, in order to increase maintainability and comprehension at the cost of correctness. Async makes it practical to implement complex control flow correctly, that is even easier to maintain and comprehend than without.

Implementation details

In general, the code in this project is written in a mostly-safe, mostly-idiomatic manner. Everything is designed to be performant, by selecting good algorithms and avoiding conventionally costly operations. It may be possible to make things faster by merging reactor and executor logic, or by not using wakers, or by writing more unsafe code, but I felt I drew a reasonable line.

This project uses "fake" I/O objects that work in memory. The I/O primitives are FakeListener, FakeStream, and Poll, analogous to TcpListener, TcpStream, and Mio's Poll. There is no client side, and thus no client-side overhead when benchmarking.

There are two kinds of tasks to perform: accept connections and process connections. The manual benchmark is implemented as a poll loop with all tasks intermingled. The async benchmarks are implemented using individual future instances for each task, that are then executed concurrently.

All variations can be run with or without syscalls. When syscalls are enabled, libc::read is called on an empty pipe every time there would have been an I/O operation.

It is relatively straightforward to write a single-threaded poll loop server that doesn't use heap allocations or synchronization primitives. Doing the same with async/await, and doing it without making extra syscalls, is a bit trickier. The following techniques are used: