Reduce task-switching overhead on I/O path

Ralith commented 3 years ago

In the high-level crate, all I/O on a connection travels across a mpsc between connection and endpoint drivers. This has a number of costs:

I/O buffers must constantly be reallocated
Inbound data must be copied
Many round trips are made through the scheduler (both userspace and, in multithreaded environments, kernel)

We could avoid these by synchronously invoking connection driver logic in the endpoint, e.g. by replacing mpsc::UnboundedSender<ConnectionEvent> with Arc<Mutex<Connection>>. This will allow for I/O to operate directly on a single set of shared buffers without excess copying or allocator traffic, and bypass the scheduler entirely for purely internal I/O such as ACKs and flow control updates.

The obvious downside is loss of parallelism in the Connection logic, where the overwhelming majority of the intended computational cost resides. Following discussion with @Matthias247, I suspect intra-Endpoint parallelism is a red herring due to the mandatory synchronization involved in sharing a single socket, which exists at the kernel level if nowhere else. QUIC is designed to scale out across multiple endpoints using load balancers, which allows for unlimited horizontal scalability (at least below the application layer). By sacrificing transport-layer parallelism within an endpoint to reduce overhead for single-threaded configurations, we therefore also reduce overhead for arbitrarily parallel deployments, even on a single host, while improving performance for the overwhelmingly more common case of few connections.

Load balancers are a common feature of massive-scale internet service deployments, but may be unfamiliar to smaller-scale service operators accustomed to getting by with a sufficiently powerful single host with failover. For future work, we should explore options for lightweight, self-contained single-host load balancing to support this pattern. QUIC's "preferred address" mechanism is an interesting option, though it requires a cooperative client.

djc commented 3 years ago

Do we have some sense of the relative costs of the three things you mentioned? I wonder if we could alleviate the first two by doing smarter buffer management (I am thinking some kind of reference-counted ring buffer segments).

The loss of parallelism in the Connection logic does seem like a big downside.

Do we have a sense of how other implementations are solving these issues?

Ralith commented 3 years ago

Do we have some sense of the relative costs of the three things you mentioned?

I understand that @Matthias247's profiling indicates task switching is a major cost center for us, though he should confirm. He mentions 10% time spent in memcpy at 330MB/s, which is significant. Not sure about malloc.

I wonder if we could alleviate the first two by doing smarter buffer management (I am thinking some kind of reference-counted ring buffer segments).

This could work okay for inbound data, though it's obviously not free. Outbound data is more awkward because the connections are the ones allocating it.

Do we have a sense of how other implementations are solving these issues?

Good question. I went and spoke with Nick Banks about msquic's approach, of interest as they report >6Gbps. They do achieve intra-endpoint parallelism, as follows:

An endpoint's connections are partitioned across the available cores. Each core has two threads: one for UDP I/O, and one to drive the actual transport logic.
Receive-side scaling, a Windows kernel feature, is used to route inbound UDP packets directly to the correct core in the common case. In exceptional cases, a packet is received on the wrong core and is passed on to the correct core for consumption by the transport logic.
Packet buffers are pooled. After the transport logic is done with a buffer, it returns it to the I/O layer for reuse.

I think that puts them in a roughly similar position to us for potentially hitting the kernel scheduler on every ACK, so clearly there's lower hanging fruit. At the very least, getting a sensible ACK delay set up should do wonders there.

The inbound buffer pooling for inbound data is an easy win, and we should pursue that. I'm not sure about refcounting vs. copying. Outbound memory could also be pooled per-connection, though that could be fairly memory-wasteful.

The explicit partitioning is interesting, because that will vastly reduce contention without sacrificing any parallelism, and allows outbound buffers to be pooled per-core instead. Historically I've been averse to this sort of thing in the hopes of remaining within tokio idioms, but in https://github.com/quinn-rs/quinn/issues/915 we're already considering spawning our own threads anyway. A key question is whether we can get Linux to route packets usefully at the kernel level. If not, we could still consider following msquic's lead in, on Linux, having a single UDP thread and a transport logic thread managing a bundle of connections per core.

Ralith commented 3 years ago

To summarize, I think the main takeaway from msquic is "bundle connections together into coarse units, pool resources within bundles, and route inbound packets directly to bundles if possible." The routing part may be difficult in non-appliance contexts on Linux.

Matthias247 commented 3 years ago

Do we have some sense of the relative costs of the three things you mentioned? I wonder if we could alleviate the first two by doing smarter buffer management (I am thinking some kind of reference-counted ring buffer segments).

You can get a feeling about this by letting Quinn and the benchmark run on the multi-threaded tokio runtime instead of the single-threaded one, which will make it more likely to trigger races and task switches during packet handling. In my experiments performance dropped by more than 30% when doing this.

Therefore it is actually rather surprising that the windows RIO implementation that I did which always require thread switches (since only IO happens on a separate IO task) to still perform that good.

One feeling or concern that I have is that the difference will be even more visible on a highly loaded server. For those eventloop delays of >= 10ms or even >= 100ms for application level eventloops are unfortunately a thing which isn't completely unheard of. If quic packet processing and enqueuing ACKs gets delayed that much the impact on performance and reliability could be rather big. Maybe a new benchmark which measures performance when doing lots of connection establishments could be interesting.

memcpy indeed has a cost, and as cited by @Ralith it was at around 10% at 330MB/s. The result is actually interesting in both ways:

It isn't the biggest bottleneck in the current architecture. Copying every packet once more on the ingress and egress part is noticable but might not have a direct impact on peak performance.
However once other things have been improved it will be one of the bigger contributors.

This could work okay for inbound data, though it's obviously not free. Outbound data is more awkward because the connections are the ones allocating it.

It can work for both. It mainly requires the connections poll_transmit method to allocate the buffer from the per-endpoint connection pool (about which it needs to be aware) and return the reference to it. That pool could be a buffer of buffers shared with the kernel for efficient RIO/io_uring/XDP transfers. The downside of this is that you then have a more specialized buffer type in quinn-proto, and that the buffer pool might potentially need to be thread-safe.

The loss of parallelism in the Connection logic does seem like a big downside.

There is indeed a loss of parallelism, and it shows up for 2 items:

The crypto operations for decrypting and encrypting packets would no longer happen in the connections context, but on the packet processing thread. These accounted for around 15% of CPU time at 330MB/s.
Handshakes which require even more expensive private key operations would block the UDP thread potentially even longer.

Do we have a sense of how other implementations are solving these issues?

Good question. I went and spoke with Nick Banks about msquic's approach, of interest as they report >6Gbps. They do achieve intra-endpoint parallelism, as follows:

For server use-cases achieving parallelism through multiple endpoints will likely be the way to go. If you have a modern 16-64 core server, there is no way to saturate a NIC which uses multiple queues with just a single UDP socket served by a single core. The data-flow can be splitted up earlier on so that multiple sockets all serve a subset of connections. This definitely requires some kind of load balancing support, which can get more or less fancy depending on requirements. That part is however then more of a "QUIC system integration" than "QUIC library" question.

Ralith commented 3 years ago

there is no way to saturate a NIC which uses multiple queues with just a single UDP socket served by a single core

I'm not sure about this. msquic seems to be employing multiqueue on a single socket to great effect.

Matthias247 commented 3 years ago

I'm not sure about this. msquic seems to be employing multiqueue on a single socket to great effect.

That's for 6Gbps - right? I'm more interested on how to utilize most out of 25Gbps to 100Gbps NICs. At those speeds typically already multiple queues are used to get packets from NICs into CPUs, and each queue is worked on by a dedicated core - because otherwise things wouldn't be fast enough anymore. It's hard to believe that you could then take all those packets and forward them onto a single socket served by a single thread and still get good performance :-)

Matthias247 commented 3 years ago

FWIW: I downloaded quiche and did some minimal modifications of their examples to do throughput benchmarking. For a similar setup as with Quinns benchmark (transferring 100MB of data on a single stream), I retrieve the following results:

[2020-11-15T03:35:17.010623300Z INFO quiche_apps] 1/1 response(s) of size 104857600 received in 298.737ms (throughput: 334.74259967797764MB/s), closing...

This is a fair but faster than Quinn (250MB/s on this host), without making any use of intra-endpoint parallelism or fancier IO strategies. It does a single send/recv at a time.

Ralith commented 3 years ago

That's for 6Gbps - right? I'm more interested on how to utilize most out of 25Gbps to 100Gbps NICs.

Right. Certainly hitting those levels would be spectacular.

It's hard to believe that you could then take all those packets and forward them onto a single socket served by a single thread and still get good performance :-)

But they're not; that's the point. Each queue terminates in a different thread.

miquels commented 3 years ago

Drive-by comment: you can use SO_REUSEPORT in Linux to bind multiple sockets to the same local address/port. The kernel will then do the loadbalancing for you. Other OSes may have something similar; in FreeBSD it's called SO_REUSEPORT_LB. Works for both UDP and TCP. I'm doing this in my NNTP server; I start N (N = number of CPUs) tokio basic (single-threaded) runtimes, each bound to a single CPU, and each then creates a socket, sets SO_REUSEPORT on it, binds it to 0.0.0.0:119, and starts listening.

For optimum performance, you (or "the sysadmin") will still have to bind the IRQs of the queues on the NIC to seperate cores or enable software RSS. Unfortunately that is not enabled by default.

Ralith commented 3 years ago

Unfortunately it's not that simple; load balancing that doesn't direct the right packet to the right thread is worse than nothing.

Ralith commented 3 years ago

Additional data point in favor of unifying endpoint/connection tasks: @Matthias247's testing shows 6.6% CPU time in channels, and that's likely an undercount. The added buffering of the inter-task queues also interacts in complex ways with congestion control.

DemiMarie commented 2 years ago

Unfortunately it's not that simple; load balancing that doesn't direct the right packet to the right thread is worse than nothing.

IIRC Google explicitly sends the packet to the correct thread in this case, and found that this is rare enough to not significantly impact performance.

rrichardson commented 1 year ago

If I may restate the issue (on Linux, specifically) to clarify. Tell me if I'm wrong here:

To illustrate by contrast, TCP uses 1 port for listening, and 1 port for each connected session. A TCP server that uses SO_REUSEPORT can enable multiple threads load-balancing the the accept() and connection creation process in a way that requires no synchronization. In addition, the state(machine) that manages the TCP connection lives in 1 place: the kernel. Therefore, a userspace service which uses TCP and SO_REUSEPORT has no burden to synchronize management of the central state machine for that connection. Any thread is free to send or receive data and has no worry of synchronization with other threads (at the transport/session level)

However, QUIC has neither of those advantages mentioned. All of the requests and all packets for all connections go to the same port. So load-balancing happens at the Packet level, not the connection/session level. In addition, the connection state lives in userspace. The state machine for each connection is now the responsibility of every thread/core that could send or receive a packet on behalf of that connection.

Interfaces with applications must still be organized into connections (and streams) otherwise there would be chaos, so some sort of synchronization is necessary.

The prevailing choices for spreading out load in QUIC on Linux are:

Job Queue Approach: Have a thread dedicated to managing the connection and the state machine for the quic protocol. Then have N threads available for responding to the top-level QUIC events (accept_bi, read, write) etc.
Synchronized Connection State Approach: Use SO_REUSEPORT, so that any participating thread can receive a packet for any connection. Any thread would then be responsible for updating the Connection state, but the state would be locked by a mutex, or some other synchronization mechanism.
Single Thread Per Connection and Worker Approach: In this approach, the construction of Connections would be

Then there are some hybrid options:

Job Queue, with trivial messaging managed by the Connection Thread when possible. To minimize round-trips on the job queue, any messages, such as Acks, that can be managed by the connection should be done by the Connection thread and not the endpoint threads. (I think this is already the current scheme)

rrichardson commented 1 year ago

It seems to me that for QUIC to achieve high throughput, we need a deus ex machina solution that consistently distributes packets based on the source:dest:port triple. So that the connection and the endpoint/worker can exist in the same thread without the need for synchronization. This would also work better for current-thread executor schemes (like io_uring)

It appears that Facebook's Proxygen does something like this with bpf_sk_reuseport

Ralith commented 1 year ago

The current architecture has an endpoint task, which distributes work to connection tasks, which distribute work to application-level tasks via the Quinn API. The immediate next step here is to get https://github.com/quinn-rs/quinn/pull/1219 merged, which removes the intervening connection tasks. I believe it's in a good state, but it's blocked on @djc's review. That review could perhaps be facilitated by breaking up one of the key commits into smaller, more incremental pieces.

A TCP server [...] requires no synchronization. [...] However, QUIC has neither of those advantages mentioned.

To be clear, TCP has a similar problem, but the requisite synchronization is done in kernel space. QUIC doesn't introduce specific new challenges here, it just moves the problem into user space instead. At least in principle, we should be able to solve it just as well as the kernel TCP stack does.

Synchronized Connection State Approach

I think this is a promising path, on the basis that I've heard Linux will distribute packets among SO_REUSEPORT instances based on 4-tuple hashes, i.e. delivery will tend to be to consistent threads. If true, that means no special OS-level measures are necessary. We could therefore achieve high single-endpoint scalability by running one endpoint driver per core, sharing the same endpoint state, with fine-grained locking of connections within that state. These fine-grained locks will have little to no contention, and hence will not bottleneck. Some care will be needed around the handling of new incoming connections.

At the extreme, QUIC scales best by using large numbers of independent endpoints on multiple hosts with QUIC-aware packet-layer load balancers in front of them. This type of deployment is complex, however, so I am interested in supporting the mid-scale case of a single endpoint driven by many threads on a single host.

vlovich commented 1 year ago

One thing I’ve been thinking about is having separate QUIC connections per CPU and making it the responsibility of the application logic to manage sharding through a pool of connections. Is that something that’s already well-supported (ie zero memcpy) or is that a sufficiently unique design idea that it’s unlikely to be well supported by a generic stack like QUINN?

Ralith commented 1 year ago

Using a separate endpoint per CPU should perform well without special effort if it suits your application. Connections on the same endpoint must regularly synchronize, though some parallelism benefit can still be had. We're interested in incrementally improving on that, but it's an open-ended research problem.

DemiMarie commented 1 year ago

@Ralith @vlovich The solution Google came up with is:

Run 1 event loop per core.
Use receive-side scaling or multiple NIC queues to direct packets to the correct core.
If a packet does wind up on the wrong core (which can happen due to connection migration) , send that packet to the correct core via explicit message passing in userspace. This is considered a slowpath because connection migration is relatively rare.
Use kernel bypass (for RX, IIRC) to improve performance.

From an application perspective, the simplest way to get good performance would be for Quinn to handle all of this itself. 2 and 4 require administrative configuration, but Quinn should be able to do 1 and 3. An alternative to 3 is to use eBPF, but this requires superuser privileges and only works on Linux.

Cloudflare’s quiche takes a different approach: quiche types are !Send and !Sync, so no internal synchronization is required. Instead, quiche requires the user to manually get the packet to the correct connection. I suspect this is because quiche does no I/O, and it is not possible to do packet dispatch in a way that is both fast and I/O-system-agnostic. (It is easy to do inefficient packet dispatch, such as what quinn does today, but that isn’t quiche’s goal.)

Ralith commented 1 year ago

At this point I'm not interested in architectures that don't support work stealing, as workloads that are guaranteed to be uniform enough for shared-nothing to be robust seem quite rare. I haven't studied quiche in depth but I understand that work-stealing is a key component of Cloudflare's architecture as well.

DemiMarie commented 1 year ago

I understand that work-stealing is a key component of Cloudflare's architecture as well.

Can you explain? If a single connection is using even 5% of a single CPU core, then 1000 will eat 50 CPU cores, and that is not acceptable at Cloudflare’s scale. I expect that Cloudflare simply drops connections that use so much CPU.

vlovich commented 1 year ago

Work stealing isn’t about how you handle CPU overload. You are correct that back pressure is how you need to handle that.

Work stealing is about making sure that when there’s a schedulable task (eg was waiting on packets that have arrived) and available CPU cycles somewhere, the task makes some forward progress. This is compared with a single threaded runtime which essentially only makes forward progress if there’s available CPU capacity in that thread.

It’s tricky to guess what makes sense at Cloudflare scale and there’s often trade offs. For example, tail latencies on work stealing in theory should be better. However, I suspect that they’re only better in the case where the machine is overloaded and you’re dropping connections anyway and in all other cases the lack of synchronization and heap allocations all over the place works in the single threaded favor. Where it gets tricky is that single threaded runtimes require a lot of specialized knowledge of how to design your application to use them and tune the runtimes properly whereas work stealing runtimes perform fairly well on arbitrary workloads and are “easier”

DemiMarie commented 1 year ago

Where it gets tricky is that single threaded runtimes require a lot of specialized knowledge of how to design your application to use them and tune the runtimes properly whereas work stealing runtimes perform fairly well on arbitrary workloads and are “easier”

Can you elaborate on this @vlovich?

quinn-rs / quinn

Reduce task-switching overhead on I/O path #914