Async functions and blocking I/O don't mix

The pageserver code seems to freely mix async functions and blocking I/O. I first noticed this in the rocksdb change (#54), but I don't have to look far to find other examples: restore_relfile and write_wal_file (#43) both do blocking I/O and are called from async functions.

I think this is a big problem.

Mixing blocking and async code together just doesn't work. At its best, we're wasting effort in building async code, only to have the code unable to achieve any concurrency because there is I/O blocking the thread. At worst, the combined binary will have confusing stalls, deadlocks, and performance bugs that are hard to debug.

To make this code async-friendly, we would either need to push the I/O work to standalone threads, or use the async equivalents.

Async file I/O is not too hard: just swap in tokio::fs for std::fs and start propagating .await and async fn as far as needed. But something like rocksdb is a lot harder: as far as I can tell, there's no such thing as an async rocksdb.

Another alternative would be to give up on the async model, and just do everything with explicit threads.

The pageserver code seems to freely mix async functions and blocking I/O. I first noticed this in the rocksdb change (#54), but I don't have to look far to find other examples: restore_relfile and write_wal_file (#43) both do blocking I/O and are called from async functions.

Hmm. restore_relfile is called via restore_timeline, which is called from get_or_restore_pagecache, which calls it while holding the Mutex of the PageCache entry. So I think that's rightly in blocking style, we use blocking style while holding the Mutex. The elephant in the room is that we shouldn't do very long running operations like that while holding the Mutex, but that's another issue. get_page_at_lsn also waits for the WAL redo thread, by waiting on walredo_condvar, which is bad.

write_wal_file is an interesting case. We currently spawn a separate thread for each WAL receiver, and then create tokio single-threaded tokio runtime in the thread. My mental model has been that each WAL receiver is a single thread, in blocking style, but it has to use tokio for interacting with the replication protocol. IIRC there are no synchronous versions of the functions in rust-postgres to open replication connection.

I think this is a big problem.

Mixing blocking and async code together just doesn't work. At its best, we're wasting effort in building async code, only to have the code unable to achieve any concurrency because there is I/O blocking the thread. At worst, the combined binary will have confusing stalls, deadlocks, and performance bugs that are hard to debug.

I agree with the general sentiment.

To make this code async-friendly, we would either need to push the I/O work to standalone threads, or use the async equivalents.

Async file I/O is not too hard: just swap in tokio::fs for std::fs and start propagating .await and async fn as far as needed. But something like rocksdb is a lot harder: as far as I can tell, there's no such thing as an async rocksdb.

Yeah, there are some operations that either simply don't exist in both styles, or are much more natural to do in blocking or non-blocking style. Another example is the code in walredo.rs. It uses async to launch a process and simultaneously write data to its stdin, and read from its stdout and stderr. That's complicated to do without async, I think you'd need to launch helper threads to spool the input/output.

I think it's inevitable that we will have both sync and async code, but we need to have some kind of principle on where to use which, what the rules are in each component, and be careful at the boundaries.

page_cache.rs: I'd prefer threaded model. I'm envisioning the Page Cache as a highly concurrent in-memory data structure, where all the operations are short. Lock-free data structures, atomic ops, mutexes and such are appropriate there. page_service.rs: This handles incoming connections and serves the GetPage@LSN requests. There could be thousands of connections, many of which are idle, so I think the async model is a good fit here. walreceiver.rs: Currently, it's a separate thread for each receiver. Could change it to use async style, with one tokio Runtime to serve all receivers. walredo.rs: Currently, one thread per timeline. Uses async to launch the 'postgres --wal-redo' process, but is otherwise in threaded style.

Things get hairy at the boundaries. Like when a get_page_at_lsn call needs to block because the requested page is not present in the page cache.

Another alternative would be to give up on the async model, and just do everything with explicit threads.

Yeah. I would like that, I find explicit threads much easier to wrap my head around. And it's easier to debug with 'gdb'. However, I'm sure there will be exceptions where async would still be needed or preferred.

I think most of the trouble is coming from mixing the async code in page_service.rs with the blocking code in page_cache.rs. We could change page_service.rs to not use async, and launch a separate thread for each connection. That's not good if you have thousands of connections, but maybe it's premature to worry about that. Use threads for now, and switch to async later if we need to. What do you think @kelvich?

Hmm. restore_relfile is called via restore_timeline, which is called from get_or_restore_pagecache, which calls it while holding the Mutex of the PageCache entry. So I think that's rightly in blocking style, we use blocking style while holding the Mutex.

Hmm, that sounds wrong to me. Maybe I'm just misunderstanding you...

Doing I/O under a Mutex is even worse. It blocks this thread while doing I/O and may block other threads waiting for the lock.* In general, anything that gets called from an async fn should do no blocking operations at all. This includes all file I/O, network I/O, and taking pthread-based locks (std::sync::Mutex, RwLock, etc). System calls that may suspend the thread should not be used. This is why you find entire libraries rewritten for async support, why tokio has fs and net modules that are re-implementations of the std library. And there is tokio::sync::Mutex and friends as well (which are async, despite copying the module name from std::sync).

* A problem if those other threads are also operated by the async executor, like if you're using the tokio multithreaded runtime.

Some related discussions that might be worth reading:

The Golden Rule - Don’t Block the Event Loop Detect and prevent blocking functions in async code Warning when calling a blocking function in an async context Rust async is colored, and that’s not a big deal

Tools to help prevent blocking in an async context don't seem to exist yet, though the rust async-foundations working group seems interested in improving that. Rust's async only became stable a bit over a year ago and Tokio 1.0 only arrived in December, so it's still early days for a lot of projects and users.

I did a quick survey of messaging channels, to see if any of them can interoperate between sync and async code. There are quite a few differences...

std::sync::mpsc: Contains "synchronous" and "asynchronous" versions. "synchronous" here actually means unbounded buffering, so the sender will never block. "asynchronous" means bounded size, so the sender may block. Receivers may block for both flavors. There is a try_recv function that fails immediately rather than block, but that doesn't help very much. No async support.

tokio::sync::mpsc: Has both blocking and async functions for both send and receive. The tokio documentation says it's OK to use this to communicate across the sync/async boundary. Unfortunately the tokio oneshot and broadcast channels are missing the blocking receive functionality.

async_std::channel: Has only async functions.

crossbeam::channel: Has a blocking send/recv and a non-blocking try_send/try_recv. No async versions.

Conclusion: if you need to do messaging between blocking & sync threads, use tokio::sync::mpsc.

In https://github.com/zenithdb/zenith/pull/60, which I just created, I refactored the communication between page_cache.rs and walredo.rs to tokio::sync::mpsc and tokio::sync::oneshot. It used crossbeam before. That changed the code slightly into the async direction, but that was just coincidental, not really the point of the PR. That said, it did fix one case of blocking in supposedly async code. Previously, get_page_at_lsn() would block waiting on a condition variable, which was bad because it's an async unction, but now it uses tokio::sync::oneshot for the waiting.

Let me share my point of view. First of all I have to say that I am dummy in Rust and actually this is first I am writing code using async/await. But certainly I familiar with idea of cooperative multitasking and have even implemented my own portable cooperative multitasking library based on setjmp/longjmp for C++ about 30 years ago:)

I always consider async model present now in most of modern programming languages as some step backward. The problem is clear: native threads are too "heavy", switching between them requires system call and adds to much overhead. Actually performance of many systems is limited now by number of syscall the server can perform (and it is not so large about ~100k/sec). But I do not understand why designers of OS/programming languages put this issues on application developers. 30 years ago there was Sun/Solaris which have hybrid threads: you are just specify how much "real" (OS) threads you need and run any number of lightweight threads on top of it using standard pthreads API. As far as I know the same approach is used fr Go threads. And actively developed project Loom (lightweight threads) for Java also makes me think that not only I consider that async "fashion" ends. The main drawback of async model (IMHO) are troubles with debugging (even with presence of all modern IDEs like IntelliJ) and necessity to write some "redundand" constructions (async, await).. And what are the advantages comparing with lightweight threads? May be slightly less memory footprint per task, but I am not sure...

But this is just "philosophic" talks. Lets return back to our Zenith problems. And concentrate concretely about pageserver. In case of Rust we have to use Tokio if our server has to serve thousands of connections. Handling each connection in separate thread is too costly. If page server only receives WAL streamfrom computation master nodes or wal_acceptors, then number of connections is not expected to be large. Even in multi-tenancy model, it is unlikely that one page server will server thousands of different databases. Most likely there will be 10-100 databases per pageserver. And normal (OS) threads will work perfectly in this case.

But unfortunately pageserver is not only receiving WAL. It's main goal is to serve pages to computation nodes. And number of such connections is proportional to number of backends=users. So it can be very large (thousands). Looks like we can not live without async in this case? I am not sure... Maintaining large number of connections between clients backends and pageserver is bad thing in any case, doesn't matter how them are handled by pageserver. We can multiplex them at computation node side and have only one connection from computation node to page server.

As far as I understand the main motivation for using tokio in pageserver was that Postgres rust API is also using async model. Certainly it is better to use existed components rather than reinvent the wheel, but I am not sure that in this case price is justified...

From my point of view mixing async model with threads is very bad and error prone idea (although Stas doesn't think so). It is like mixing signals and threads in Unix applications. I prefer that our code will either use native threads, either tokio tasks (with pool of OS threads). But I am not sure that it is possible.

My current experiments with Zenith perfoamance makes me think that pageserver will not be a bottleneck. As vanilla Postgres, speed of typical OLTP workload will be limited by fsyncs, i.e. wal_acceptors. Right now Zenith with wal_acceptors shows almost the same OLTP speed as standalone Postgres with fsync=on. And speed of Zenith with page servers directly connected to master (w/o wal_acceptors)is several times faster even with current very inefficient redo approach (pipe with wal-redo-postgres process).

Postgres is able to flush changes of multiple transaction using one fsync which allows to achieve better throughput if there are many concurrent transactions. Another way to improve performance is to enable asynchronous commit. In this case we can loose some number of last transactions (confirmed to the users) but speed is significantly increased. Unfortunately we can not use asynchronous commit in Zenith, because in case of synchronous replication transaction may be blocked for a long time. So performance with small number of clients will be awful. And if there are large number of active clients, then... transaction WAL writes will be efficiently combined even without async commit... Another interesting question is whether we can use asynchronous application if user is ready to loose last N transactions...

So from my point of view the key point to improve Zenith performance will be to somehow increase WAL writing speed. Either by sharding WAL (in this case writing WAL will be performed in parallel), either by more efficient grouping of flushed WAL records. But it seems to be too far from out async discussion...

Concerning advantages of AIO model, especially new Linux io_uring API - I do not have some opinion and experience now. There were some attempts to add AIO support to Postgres but as far as I know without any essential success. I didn't heard about some benchmarks with demonstrate advantages of AIO on typical database workload. Moreover, right now most of database fit in memory, so IO speed is mostly important for writing WAL (it may be not true for cloud databases). Also modern storage devices (like PMEM) are no more block oriented. So classical IO model based on buffers/pages seems to be obsolete and inefficient for such hardware. In any case, I think that it is too early to discuss such aspects (AIO vs. sync IO) at this stage of the project.

So, we have basically two options:

1. Switch to blocking model

Change all current code that use async to blocking style, with the following exceptions:

In rust-postgres', thecopy_bothfunctionality has only been implemented in 'async' fashion, in the 'tokio-postgres' package. We have to use tokio async to use that. Or perhaps we should write blocking wrapper functions inrust-postgresfor that.rust-postgres` has blocking version of all(?) other functions, written as wrappers around the async versions.
In walredo.rs, we use async to simultaneously write to the child process's stdin and read from its stdout and stderr. Without async, that would require spawning two helper threads just to handle the i/o. I think we should continue to use async here, as it's just so much more natural. But let's limit the use of async to that one function that communicates with the process.
restore_s3.rs uses async, to fetch a number of files concurrently from S3. That's dead code at the moment, but we will need code to backup/restore from S3 eventually. I think async would make sense for that, too, although it wouldn't be hard to implement with threads either.

2. Switch to async everywhere

Remove the current calls that create explicit threads, replacing them with tokio::spawn instead. Have just one Tokio runtime. Be careful to not use blocking calls anywhere, or if you must, use spawn_blocking().

My opinion

In general, I like blocking style more. Async is just harder. It's harder to debug, and you have to be careful to not call functions that block, or you get weird deadlock or starvation issues. You can run into deadlock issues with explicit threads, too, but it's easier to reason about.

We might need to use async for performance reasons. In two places in particular: 1. in page_service, to serve a large number of connections. Having one thread for each connection will eat a lot of resources, if most of them just sit idle. 2. If we have a lot of I/Os in flight.

However, we're nowhere near the scale where that matters. Right now, I think explicit threads would be better. Maybe we'll need to optimize later, if/when those things become a bottleneck. How hard is it to change the model, or introduce async to the performance-critical parts later? Would it be easier to do everything in async fashion from the start, even if it means harder-to-debug code and slows us down somewhat, or use explicit threads now and rewrite later if required? I'm not sure. My gut feeling is to use explicit threads, but I don't feel strongly about it.

I always consider async model present now on most of modern programming languages as some step backward.

I agree with that sentiment. Async puts the burden on the application developer. If threads were free, no one would use async. (Well, maybe there are exceptions where it really is a nicer model, like in walredo.rs).

My current experiments with Zenith performance makes me think that pageserver will not be a bottleneck. As vanilla Postgres, speed of typical OLTP workload will be limited by fsyncs, i.e. wal_acceptors. Right now Zenith with wal_acceptors shows almost the same OLTP speed as standalone Postgres with fsync=on. And speed of Zenith with page servers directly connected to master (w/o wal_acceptors)is several times faster even with current very inefficient redo approach (pipe with wal-redo-postgres process).

In a multi-tenant environment, we will put enough load on each page server, until it becomes the bottleneck. If a page server has free capacity, we will scale down and merge it with another page server, so that all page servers are busy again (with some margin for sudden spikes). So it's not really a question of whether the page server is a bottleneck, but rather how many page servers we will need to deploy.

(Although it's possible that the page server software is efficient enough that it's bottlenecked by the page server's I/O hardware.)

This was addressed by commit 3b9e7fc5e6715a694b27e059ee7eedd1f6f44735. I'm sure we will bump into async vs sync issues again in the future, and perhaps we should also change walkeeper to use explicit threads for consistency. But I consider this as fixed in the pageserver for now. If there are any further issues related to this, let's open new tickets for them.

neondatabase / neon