oxen-io / oxen-mq

Communications layer used for both the Oxen storage server and oxend
https://oxen.io
BSD 3-Clause "New" or "Revised" License
19 stars 35 forks source link

Add epoll support for Linux (huge proxy thread CPU reduction) #87

Closed jagerman closed 11 months ago

jagerman commented 11 months ago

Each call to zmq::poll is painfully slow when we have many open zmq sockets, such as when we have 1800 outbound connections (i.e. connected to every other service node, as services nodes might have sometimes and the Session push notification server always has).

In testing on my local Ryzen 5950 system each time we go back to zmq::poll incurs about 1.5ms of (mostly system) CPU time with 2000 open outbound sockets, and so if we're being pelted with a nearly constant stream of requests (such as happens with the Session push notification server) we incur massive CPU costs every time we finish processing messages and go back to wait (via zmq::poll) for more.

In testing a simple ZMQ (no OxenMQ) client/server that establishes 2000 connections to a server, and then has the server send a message back on a random connection every 1ms, we get atrocious CPU usage: the proxy thread spends a constant 100% CPU time. Virtually all of this is in the poll call itself, though, so we aren't really bottlenecked by how much can go through the proxy thread: in such a scenario the poll call uses its CPU then returns right away, we process the queue of messages, and return to another poll call. If we have lots of messages received in that time, though (because messages are coming fast and the poll was slow) then we process a lot all at once before going back to the poll, so the main consequences here are that:

1) We use a huge amount of CPU 2) We introduce latency in a busy situation because the CPU has to make the poll call (e.g. 1.5ms) before the next message can be processed. 3) If traffic is very bursty then the latency can manifest another problem: in the time it takes to poll we could accumulate enough incoming messages to overfill our internal per-category job queue, which was happening in the SPNS.

(I also tested with 20k connections, and the poll time scaling was linear: we still processed everything, but in larger chunks because every poll call took about 15ms, and so we'd have about 15 messages at a time to process with added latency of up to 15ms).

Switching to epoll drastically reduces the CPU usage in two ways:

1) It's massively faster by design: there's a single setup and communication of all the polling details to the kernel which we only have to do when our set of zmq sockets changes (which is relatively rare). 2) We can further reduce CPU time because epoll tells us which sockets need attention, and so if only 1 connection out of the 2000 sent us something we can only bother checking that single socket for messages. (In theory we can do the same with zmq::poll by querying for events available on the socket, but in practice it doesn't improve anything over just trying to read from them all).

In my straight zmq test script, using epoll instead reduced CPU usage in the sends-every-1ms scenario from a constant pegged 100% of a core to an average of 2-3% of a single core. (Moreover this CPU usage level didn't noticeably change when using 20k connections instead of 2k).