Noah-Kennedy commented 9 months ago

The problem

When building shared-nothing systems at scale, load balancing new connections is a significant challenge.

To summarize that blog post, SO_REUSEPORT can often introduce new sources tail latency because it splits the new connections into per-socket queues regardless of whether or not they are currently doing work. A worker parked on epoll_wait is generally an excellent candidate for a new connection when compared to a worker currently handling existing connections, as a worker who is currently looking for work anyways clearly has the capacity to accept the new connection. Note that even eBPF load SO_REUSEPORT balancing isn't ideal here, as an eBPF script can't really be thus smart. Therefore, it tends to be better for latency to load balance with epoll. Unfortunately, this can't be done with the vanilla set of options - normally, if multiple epoll instances watch the same socket, all of them get the notification, leading to a thundering herd.

Fortunately, the EPOLLEXCLUSIVE flag resolves this issue by ensuring that only one waiting epoll instance with the flag set for the particular interest will get the notification. EPOLLEXCLUSIVE is, as a result, extraordinarily useful for at-scale shared-nothing systems. It isn't always the best approach depending on how sensitive a system is to even load balancing vs TTFB, but it's an important element of any shared-nothing toolbox.

At Cloudflare, we have services which use tokio in both shared-nothing and work-stealing configurations and make extensive use of EPOLLEXCLUSIVE and other atypical epoll flags. Based on our experience serving diverse types of traffic at scale, we think that allowing users to leverage custom epoll flags would make tokio a significantly more powerful toolkit for users working on shared-nothing systems.

The solution

I have a POC patch which I can push later which adds a new from_std variant to several types (currently just the TCP and AF_UNIX stream listeners) which allows the specification of the exact set of epoll flags to use when registering the socket with our epoll descriptor. If we made this fallible, it wouldn't block the use of io_uring or similar in the future, as we could just document that this only works if you are using epoll. We could potentially do that only with AsyncFd, or with the listener types as I implemented in the POC, or both.

We could also try and add in EPOLLEXCLUSIVE as a new IO interest that users can specify, but this has all of the issues of the POC approach I took, while being more complicated for us to implement and less flexible for users. For that reason, I'd recommend something along the lines of option number one.

If this RFC is accepted, I can take responsibility for the implementation of this.

Because Mio exposes the raw fd of the epoll instance, it can be bypassed entirely for the purposes of implementing this functionality in Tokio. As a result, Mio support is not a prerequisite for Tokio having this functionality.

Darksonn commented 9 months ago

This seems reasonable enough. I think the main question here is how the new from_std api should look. It would make sense to think about how we can choose an api that is extensible in the future, e.g. for passing flags to io_uring, kqueue, or windows afd.

Nerdy5k commented 9 months ago

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

Noah-Kennedy commented 9 months ago

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Nerdy5k what are your goals, and how does this impact your ability to use this library?

Nerdy5k commented 9 months ago

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Nerdy5k what are your goals, and how does this impact your ability to use this library?

I want to keep the metal io approach as much as possible without delegating to separate api workers.

Noah-Kennedy commented 9 months ago

Pushing against the POC Patch approach as this does not align with my future goals purpose of this library.

@Nerdy5k what are your goals, and how does this impact your ability to use this library?

I want to keep the metal io approach as much as possible without delegating to separate api workers.

This doesn't force you to change how you use tokio or mio. It just opens up new options for others who are currently using shared-nothing.

Noah-Kennedy commented 9 months ago

@Nerdy5k could you elaborate on what you mean here?

We aren't talking about changing the innards of tokio in any way which modifies existing behavior, merely adding a new way to construct registered sockets. This doesn't impact the current IO approach, just allow a new way to interface with it.

I'm not sure what you mean by "separate API workers". I suspect this to be the result of confusion?

Noah-Kennedy commented 9 months ago

I put up the POC here: https://github.com/tokio-rs/tokio/pull/6089

carllerche commented 8 months ago

The blog post uses level-triggered notification, which allows the code to perform 1 accept() per epoll_wait. Tokio uses edge-triggered, which means users must call accept() until EWOULDBLOCK, which somewhat defeats the load balancing aspect.

Can you address this?

Noah-Kennedy commented 8 months ago

Sure!

You bring up a good point here: while there are valid reasons to use EPOLLET | EPOLLEXCLUSIVE, you generally want to use level triggered with the accept, not just because of load balancing, but also because of short-term starvation issues under a burst of new connections. There are situations where you want this flag combination, but they are a minority of cases.

This skipped my mind earlier in the convo, but was one of the reasons that I crafted the patch this way, with users controlling their own flags including interests. Thanks for reminding me of this; I need to add some notes to the documentation regarding this case.

tokio-rs / tokio

RFC: User-specified epoll flags #6084

The problem

The solution