Open Ralith opened 1 year ago
On review, handling multiple connections on a single task might actually be undesirable as it prevents tokio's work-stealing from redistributing work when e.g. a single connection involves disproportionately large amounts of work. Splitting up the endpoint task is still probably valuable, but maybe we don't want to inline connection tasks after all.
Also note tokio's multithreaded runtime relies on a single global epoll loop to drive I/O across all threads, which reportedly allows for more efficient work stealing than epoll-per-thread would. Unclear if this will be a significant bottleneck for Quinn.
Hi, I've encountered related issue, when handling thousands of connections, quinn server cannot be scaled horizontally. The CPU cannot be used fully.
Quinn, like any flexible QUIC implementation, supports horizontal scaling across multiple endpoints through the use of custom connection ID generators and QUIC-aware load-balancing front-ends. However, the requirement for a third party load balancer and custom logic makes this difficult to leverage. For the overwhelmingly common case of applications that fit comfortably on a single host, Quinn should allow a single endpoint to scale the number of concurrent near-linearly with respect to available CPU cores.
In the current architecture, separate connection drivers already allow significant cryptographic work, and all application-layer work, for independent connections to happen in parallel. A bottleneck remains at the endpoint driver, an async task responsible for driving all network I/O and timers. We can do better.
The Linux
SO_REUSEPORT
option will distribute incoming packets on a single port across multiple sockets. Packets are routed to sockets based on a 4-tuple hash, so we can rely on connections moving between drivers only in the event of migrations, minimizing contention outside of connection setup/teardown. Windows "Receive Side Scaling" may be similar. On at least these platforms, we can therefore run up to one endpoint driver per core. Some architectural changes are required to prevent catastrophic contention.Key
quinn_proto::Endpoint
methods take&mut self
, preventing meaningful parallelization of datagram processing. We should embrace interior mutability to convert these to&self
methods which will not contend on the hot path of handling datagrams of established connections. In particular:Connection
states must be concurrently mutable. Morally this could be as simple asRwLock<Slab<Mutex<Connection>>>
, though std APIs for nested locks are awkward.quinn
layer.Tokio presently lacks a mechanism to ensure certain tasks run on separate threads, which may complicate improving parallelism for high-level users. We could address this by spawning our own threads, or by working with upstream to develop new APIs. It's also possible that simply spawning N drivers and letting work stealing do its thing might work out well enough in practice.
Unified Drivers
This refactoring has previously been associated with the unification of endpoint and connection drivers (e.g. #1219). I believe they can be separated, though moving
quinn_proto::Connection
ownership intoquinn_proto::Endpoint
may still be desirable for API simplicity and to reduce perhaps costly inter-thread communications. To avoid undermining our current parallelism, this should only be pursued after endpoints become horizontally scalable.Flattening connection tasks complicates timer handling, since we don't want to poll each connection's timer on every endpoint wakeup. My
timer-queue
crate provides a solution. Each endpoint driver could maintain timers for connections involved in traffic passing through that driver. By discarding timeouts for a connection that has seen activity more recently on another driver we can ensure timers do not reintroduce contention.Future Work
Opportunity also exists for fine-grained parallelism between streams on a single connection, similar in spirit to connections in a single endpoint.