Intra-endpoint horizontal scaling

Ralith commented 1 year ago

Quinn, like any flexible QUIC implementation, supports horizontal scaling across multiple endpoints through the use of custom connection ID generators and QUIC-aware load-balancing front-ends. However, the requirement for a third party load balancer and custom logic makes this difficult to leverage. For the overwhelmingly common case of applications that fit comfortably on a single host, Quinn should allow a single endpoint to scale the number of concurrent near-linearly with respect to available CPU cores.

In the current architecture, separate connection drivers already allow significant cryptographic work, and all application-layer work, for independent connections to happen in parallel. A bottleneck remains at the endpoint driver, an async task responsible for driving all network I/O and timers. We can do better.

The Linux SO_REUSEPORT option will distribute incoming packets on a single port across multiple sockets. Packets are routed to sockets based on a 4-tuple hash, so we can rely on connections moving between drivers only in the event of migrations, minimizing contention outside of connection setup/teardown. Windows "Receive Side Scaling" may be similar. On at least these platforms, we can therefore run up to one endpoint driver per core. Some architectural changes are required to prevent catastrophic contention.

Key quinn_proto::Endpoint methods take &mut self, preventing meaningful parallelization of datagram processing. We should embrace interior mutability to convert these to &self methods which will not contend on the hot path of handling datagrams of established connections. In particular:

Independent Connection states must be concurrently mutable. Morally this could be as simple as RwLock<Slab<Mutex<Connection>>>, though std APIs for nested locks are awkward.
Global locking on connection setup/teardown is difficult to avoid, but probably tolerable. See also sharded-slab, a complex but fast storage option.
An explicit "endpoint view" notion in quinn-proto might be useful to allow lock-free access to per-socket state such as transmit queues without pushing too much more complexity into the quinn layer.

Tokio presently lacks a mechanism to ensure certain tasks run on separate threads, which may complicate improving parallelism for high-level users. We could address this by spawning our own threads, or by working with upstream to develop new APIs. It's also possible that simply spawning N drivers and letting work stealing do its thing might work out well enough in practice.

Unified Drivers

This refactoring has previously been associated with the unification of endpoint and connection drivers (e.g. #1219). I believe they can be separated, though moving quinn_proto::Connection ownership into quinn_proto::Endpoint may still be desirable for API simplicity and to reduce perhaps costly inter-thread communications. To avoid undermining our current parallelism, this should only be pursued after endpoints become horizontally scalable.

Flattening connection tasks complicates timer handling, since we don't want to poll each connection's timer on every endpoint wakeup. My timer-queue crate provides a solution. Each endpoint driver could maintain timers for connections involved in traffic passing through that driver. By discarding timeouts for a connection that has seen activity more recently on another driver we can ensure timers do not reintroduce contention.

Future Work

Opportunity also exists for fine-grained parallelism between streams on a single connection, similar in spirit to connections in a single endpoint.

Ralith commented 1 year ago

On review, handling multiple connections on a single task might actually be undesirable as it prevents tokio's work-stealing from redistributing work when e.g. a single connection involves disproportionately large amounts of work. Splitting up the endpoint task is still probably valuable, but maybe we don't want to inline connection tasks after all.

Also note tokio's multithreaded runtime relies on a single global epoll loop to drive I/O across all threads, which reportedly allows for more efficient work stealing than epoll-per-thread would. Unclear if this will be a significant bottleneck for Quinn.

PureWhiteWu commented 8 months ago

Hi, I've encountered related issue, when handling thousands of connections, quinn server cannot be scaled horizontally. The CPU cannot be used fully.

quinn-rs / quinn

Intra-endpoint horizontal scaling #1576

Unified Drivers

Future Work