io_uring - Githubissues

Darksonn commented 4 years ago

The purpose of this issue is to collect technical notes on how io_uring can be implemented in Tokio in the future.

Parent issue: #2692

Tech Notes

Adding items to the queue. Submission is done by appending requests to a ring queue. Every request includes an u64 field called userdata, which is used to identify which IO operation just completed. I don't know if it can be avoided by being clever, but the io_uring crate has a mutex around pushing to this queue. The queue must be a power of two in length. Atomics are needed in any case to synchronize with the kernel updating the indexes in the ring buffer.

Receiving items. The submitted requests complete in arbitrary order, and the kernel writes them to a ring buffer that has twice the number of slots of the request buffer. The userdata u64 is returned with the completion.

Submitting items. The kernel does not independently decide to go look in the queue for any new submissions (unless you turn on a root-only flag). There is a syscall called io_uring_enter, which allows you to make the kernel notice all newly submitted requests. You can optionally ask it to block until some number of completions have been written into the completion queue. Putting a timeout on this blocking syscall is done by submitting a timeout IO operation to the queue.
Note that the api doesn't mind if you have more in-flight operations than the number of slots in either queue.

Waking up. To wake up a thread currently blocking on io_uring_enter, you can simply submit a no-op IO operation.

Submitting while another thread is sleeping. Just because another thread is currently blocking on io_uring_enter doesn't mean the kernel will detect new submissions automatically. You have to call io_uring_enter again to do that. (you can choose not to block)

Linking requests. Links are created by setting a certain flag, which links the submission with the next operation in the queue. This means that all such chains must be submitted in one big chunk (and they can't be trees I guess), which puts somewhat of a wrench in a discussion previously in this thread.
Note that any operations can be linked. This means you can do stuff like "read into this buffer from tcp, then write the same buffer to this other tcp stream"

Multiple rings. See this comment.

Trait ideas

trait AsyncRead {
    /// Submit a buffer to be filled
    fn submit_buffer(&mut self, buffer: BytesMut);

    /// Received the next buffer filled with data received from the stream
    fn poll_read(&mut self, cx: &mut Context<'_>) -> Poll<io::Result<Option<BytesMut>>>;
}

trait AsyncWrite {
    /// Submit a buffer to be written to the underlying stream
    ///
    /// How do you get the buffer back?
    fn submit_write(&mut self, buffer: Bytes);

    /// Check if all submitted writes have been "completed"
    fn poll_flush(&mut self) -> Poll<io::Result<()>>;
}

Open questions

How to conditionally compile? Only new kernels support it.
How to read into a specific slice:

async fn read_offset(&mut self, buf: &mut [u8], off: usize) -> usize {
    self.read(&mut buf[off..]).await
}

Ralith commented 4 years ago

How to conditionally compile? Only new kernels support it.

Conditional compilation probably isn't useful here, unless you expect downstream crates to opt into producing a binary that will only start on very new systems. The usual approach here is to make syscalls manually (rather than relying on a bleeding-edge glibc being linked), which avoids a link/load-time dependency, and check the kernel version/feature support explicitly before doing so.

Matthias247 commented 4 years ago

I think before diving into the depth of all the implementation questions it might be a worthwhile exercise to define some goals and exceptions around this exercise. The goal shouldn't be "Let's support io_uring, because heard it's cool" 😀

Here are some ideas:

Performance goals

Reduce the overhead for file system operations

Currently FS operations follow one of 2 strategies:

They block the current thread, and migrate the runtime to another thread. The downside of this approach is that it doesn't really work for a purely single-threaded runtime. And blocking a runtime thread also leads to different issues, like problems with shutdown and structured concurrency.
They queue the IO operation on a threadpool. Transferring the operation to a threadpool, executing it there, and yielding the result back to the executor thread exhibits a certain amount of overhead. In addition to this we need 1 Threadpool thread per pending IO operation. It would be great if we can avoid this overhead.

Since io_uring allows for async FS operations it could avoid both issues. FS operations would be enqueued via io_urings submit queue, and get processed there asynchronously. When ready the executor continues the task which required the operation. If the completion queue drives the executor thread we don't need any thread switch. If the completion queue runs on an extra thread there would be one thread switch - instead of 2 with the threadpool solution.

Reduce the amount of system calls

System calls get more and more expensive due to security mitigations (Spectre, Meltdown). Therefore it is desirable to minimize the amount of system calls. With the current IO primitives we need at least 1 system call per IO operation (e.g. a read() or write()), plus extra operations if the IO primitive is not ready. io_uring should provide the ability to avoid these system calls, since their equivalents get just enqueued on the submission queue. The system call which polls the completion queue is shared between all IO operations.

Integration goals

Should be usable for "common" Tokio applications

I think whatever support is added, it should be in a form which makes it usable for common operations that are built on top of tokio - and should avoid requiring them to restructure their whole application logic around a certain paradigm.

E.g. currently we have applications built around AsyncRead/AsyncWrite , as well as their owned Stream<Bytes> and Sink<Bytes> cousins. Adding a third kind of interface doesn't sound very desirable. If there has to be, it should at least supersede one of the existing interfaces.

One example application that many people might build on top of tokio is a webserver pipeline, which might look like:

Socket <--> TLS <--> HTTP (/1.1 or /2) <--> Content-Encoding (e.g. gzip) <--> File

We should figure out how completion based IO operations can speed up such a pipeline without having to rewrite the whole application based on it. I e.g. don't think it's likely we will get a fully IO completion based TLS stack - certainly not short term and probably not even medium term. It will have it's internal buffer, and might still do best by exposing itself as AsyncRead/Write interface to the next stage in the pipeline. However might might use a certain buffer underneath that we could write/read from the socket in a completion based fashion - e.g. using buffered readers/writers in front of the TLS stack.

This should however also not be the only application model we would want to care about. It should also make sense for a RPC stack (e.g. Thrift), a messaging system, etc.

Should not enforce an allocation model

I would heavily prefer that we keep all buffer allocations a job of the application. These probably know best what to do with their buffers (get a new one for every operation, always use the same buffer for operations, pool them, etc). Moving to an API which lets the IO primitives allocate buffers has a high potential to lead to buffer churn (continuos allocations/deallocations) or to run either out of buffers or lead to over-allocation.

By keeping allocations a responsibilitly of the application we can void this.

This rules out APIs along the following, which produce buffers out of the nowhere:

async fn read(&mut self) -> Result<Bytes, io::Error>

Should also work with other IO completion backends and shimmed implementations

Tokio should continue to work on all the platforms that it is currently working on. Even if APIs are changed and new APIs are introduced. And even more so: It should not take a performance hit on the existing platforms. Lots of users will not be able to update to Kernel 5.5+ in the short term, but they might have existing tokio code that should continue to run and needs to be further evolved.

Additionnaly there are exist platforms which also offer support for IO completion based operations besides Linux. The most common one is Windows, which provides IOCP support for decades. New APIs which support io_uring should also work on top of IOCP.

We could assume more operationg systems will follow in the future.

Non goals

Expose all features of io_uring

I don't think it's a necessity for tokio to make use of all uring features. A lot of them might be niche features, which only make sense for a certain kind of application. They might require the whole application to be structured around the feature. Since we want to avoid building applications directly around uring, we should avoid using or exposing those.

As an example, I am not sure whether we would need to expose the support for linking operations to applications.

MikailBag commented 4 years ago

It seems to me API can look like this:

impl tokio::runtime::Builder {
    fn with_buf_alloc<A: std::alloc::Alloc>(&mut self, a: A);
}
// ABox is like usual Box, but allocated in A instead of `std::alloc::System`.
fn read(path: &Path) -> impl Future<Output = ABox<[u8]>>;
fn write(path: &Path, contents: ABox<[u8]>) -> impl Future<Output=()>;

Of course, it is quite complicated. OTOH:

This API is (probably) sound.
Easy to use, if you don't need manual control: user can pass System and now Tokio will allocate all buffers on heap.
Allows users to implement advanced techniques, such as pools, arenas, and so on.

In general I think that Alloc trait is really similar to this buffer-management problem.

Matthias247 commented 4 years ago

@MikailBag We can already achieve similar things using the Bytes/BytesMut types. Those can already wrap a byte arrays which have been obtained through a variety of allocators.

And we could pass allocators like the following through the application:

trait BytesAllocator {
    // Allocate a buffer, incl. having the ability to signal out-of-memory conditions
    fn allocate(&self, min_size: size) -> Option<BytesMut>;
}

We can do even "sleeping allocators" that wait till have a buffer is free, by making the allocation function an async function (or adding an wait_for_buffer() -> Notify method to make it possible to wait before doing the allocation).

Now Bytes might not be the perfect buffer type yet, and we might want to enhance it before using it excessively in new IO APIs. But I think its type-erased nature would be the right way to go, since it is a lot cleaner than adding custom box and allocator types everywhere.

mystor commented 4 years ago

It would be nice to support some form of generic type as the buffer argument, for situations where it might be desirable to share ownership of part of a more complex data structure.

For example, it would be nice to be able to write the uniquely-owned data field of a struct allocated in an Arc without copying the data out of the Arc to hand ownership to io_uring. If async_write_buf (or whatever the name ends up being) supported a generic type implementing bytes::Buf or a similar trait, it could be possible to write an adapter type to support this use-case:

struct Message {
    data: Box<[u8]>,
    // ... metadata, etc.
}

struct MessageDataRef {
    msg: Arc<Message>,
    offset: usize,
}
impl bytes::Buf for MessageDataRef { ... }

let msg: Arc<Message> = ...;
let _ = fd.async_write_buf(MessageDataRef {
    msg: msg.clone(),
    offset: 0,
}).await?;

I don't think a bytes::Bytes argument would handle this case directly without copying or changing the type of Message::data.

Matthias247 commented 4 years ago

bytes::Bytes will be able to handle this just fine as soon as the vtable-based constructor for Bytes is public (see https://github.com/tokio-rs/bytes/issues/310 for example).

mystor commented 4 years ago

Oh, nice! Nevermind then, sounds like bytes::Bytes would work perfectly :-)

kaimast commented 3 years ago

I'm curious if anyone is working on this? Would be great to have at some point.

Darksonn commented 3 years ago

We have indeed been doing some experiments on how to approach it.

Kestrer commented 3 years ago

For the record, the experiments have been made public here: https://github.com/tokio-rs/tokio-uring.

ibotty commented 2 years ago

Is there a (possibly remote) plan to incorporate io_uring into proper tokio when kernel is recent enough? So that applications can fall back to using synchronous IO when it's not available.

Darksonn commented 2 years ago

Sure, we want to eventually add some sort of io_uring support, especially for files.

Noah-Kennedy commented 1 year ago

At this point, I think that proper async file IO is the least of what it can do. It's rapidly becoming an extremely powerful and flexible API for asynchonously doing any IO operation. Especially with zero-copy, multishot, a lot of general optimizations not possible with epoll, solid synergy with reactor-per-core, and kernel-managed buffer groups it has become a truly incredible API for networking alone.

Frostie314159 commented 1 year ago

Theoretically, conditional compilation could be achieved through an environment variable, passed via a build script, which checks for the presence of the necessary io_uring functionality.

tokio-rs / tokio

io_uring #2411

Tech Notes

Trait ideas

Open questions

Performance goals

Reduce the overhead for file system operations

Reduce the amount of system calls

Integration goals

Should be usable for "common" Tokio applications

Should not enforce an allocation model

Should also work with other IO completion backends and shimmed implementations

Non goals

Expose all features of io_uring