Performance under tokio multi_thread vs current_thread runtime

shaun-cox commented 1 year ago

We've built a router of sorts using Quinn, and so far, have been thrilled with how easy it's been to get up and going with Quinn's nice API.

Now we're in the phase of performance measuring and our metric of interest is "megabits per core", or how much data can we move through our router per unit of CPU consumption.

Our use case is not a typical "bulk throughput" scenario where we have all the data up front and need to flow it all quickly and measure goodput. Our scenario is small packets sent by clients at paced intervals (e.g. 160 byte chunks sent every 2.5ms, such as by an Audio codec's output). The router receives many streams of this nature and simply relays the chunks out on some other QUIC send streams.

The surprising anomaly we've discovered is that, for the identical offered load to the router, if we switch to using #[tokio::main(flavor = "current_thread")] from #[tokio::main(flavor = "multi_thread", worker_threads = 4], we see a reduction in CPU consumed by about 71%. (1172 millicores to 335 millicores)

At first, we thought this was mainly due to the appearance of Mutex::lock_contended as the top-most hit showing in perf, but upon closer inspection, we've also noticed that when using the multi-threaded tokio runtime, Quinn ends up generating about 3x the number of outgoing UDP datagrams from the router.

We're currently able to achieve only about 466 Mbits/core in this single-threaded runtime scenario, which seems pretty low. We'd envisioned running pods with a 4 core CPU limit, and getting higher throughput, but if we do, we only achieve 133 Mbits/core, which is way less efficient and more costly.

We're wondering if anyone can offer an explanation for the 3x increase in outgoing UDP datagrams just by switching runtimes. We're assuming it has to do with fewer opportunities for internal batching or realizing that multiple stream frames or QUIC packets can be coalesced into the same UDP datagram?

We're also curious about the choice of having to acquire that Mutex protecting state in quinn::ConnectionInner for every poll_read and poll_write on any stream associated with a connection. Isn't this an immediate scalability killer if the tasks that read/write from Quinn streams get spread out across the multi-threaded runtime's thread pool?

Thanks.

Ralith commented 1 year ago

have been thrilled with how easy it's been to get up and going with Quinn's nice API.

Glad to hear it!

Our scenario is small packets sent by clients at paced intervals (e.g. 160 byte chunks sent every 2.5ms, such as by an Audio codec's output)

Under these conditions, per-packet overhead is likely to dominate, so a more relevant metric might be packets per second per core. For example, you'll likely observe that the bulk benchmark in this repository demonstrates much higher bandwidth due to larger packets.

We're wondering if anyone can offer an explanation for the 3x increase in outgoing UDP datagrams just by switching runtimes.

Your inference that this is due to reduced batching is likely correct. Quinn does not presently employ any logic analogous to Nagle's algorithm: as soon as the driver task is aware of data to be send, it will be sent. If a system tends to perform many small writes, then in a multithreaded context, to a first approximation, the driver will be woken immediately after each write, whereas in a single-threaded context the driver must wait to be polled, which will tend to lead to greater batching. Supporting evidence for this hypothesis would be differences in packet size. You might also want to try the git HEAD (due to become 0.9 Real Soon Now™️) which has substantially reduced the frequency of ACKs, which is another reason you might see more packets sent regardless of application behavior.

Quinn could have mechanisms to better control this. For example, perhaps we could inject configurable latency to driver task wake-ups triggered by writes. However, expected performance is higher for single-threaded regardless runtimes due to reduced contention, and such architectures scale better to multiple hosts. An ideal large-scale QUIC deployment involves a QUIC-aware load balancer and/or preferred address mechanism balancing incoming connections across a large number of endpoints, which may be distributed across any number of hosts. A 4-CPU system might host 4 endpoints, for example.

That said, I understand that smaller deployments might not want to deal with that level of complexity. There is some low hanging fruit for improving Quinn's ability to parallelize processing of many active connections on a single endpoint.

We're also curious about the choice of having to acquire that Mutex protecting state in quinn::ConnectionInner for every poll_read and poll_write on any stream associated with a connection.

Many different threads writing on the same connection is indeed a contention hazard. This architecture was motivated by simplicity and the expectation that highly scalable systems will tend to be endpoint-per-core anyway.

I think more fine grained Stream-scoped synchronization could make sense. This would likely require pushing synchronization down into quinn-proto rather than quinn, but I've been thinking about doing that anyway for the improved connection-granularity parallelism mentioned above.

Matthias247 commented 1 year ago

Quinn ends up generating about 3x the number of outgoing UDP datagrams from the router.

Are those "useful packets" (carry more user-data like non-retransmitted stream frames) or "just more packets" (like stream frames which would eitherwise be coalesced into single packets get now transmitted separately, or just more ACK packets).

Without that knowledge, it's really hard to say whether 3x is a good thing or a bad thing.

, we see a reduction in CPU consumed by about 71%. (1172 millicores to 335 millicores)

That is the total consumed CPU? It's surprising! If you send more packets for the same amount of user-data, it should actually require much more CPU - because the most costly thing are networking syscalls. If your metric is however the the average load per core it might be lower however since the multithreaded runtime is able to spread load a bit between cores (and a bit isn't actually too much, it will mostly make some crypto operations run on a different thread than networking. But since networking dominates so much it won't scale that much more).

Kiddinglife commented 1 year ago

We're wondering if anyone can offer an explanation for the 3x increase in outgoing UDP datagrams just by switching runtimes.

hey @shaun-cox , if your app is throughput-centric, Is it possible for you guys to accumulate the small writes from the app layer before passing it to Quinn API?

shaun-cox commented 1 year ago

Are those "useful packets"

I don't suspect the presence of retransmitted frames, though can double check at some point when I get back to perf testing. We're on a pretty clean internal network with >=10GbE adapters/switches, and the offered load is itself paced, so I doubt that there is any bursting of sufficient size to overflow any switch buffer. The only difference between the runs is the choice of single or multi-threaded tokio runtime, so if that is the cause of retransmitted frames between before and after, that would also be curious. I would suspect more ACK frames though.

That is the total consumed CPU? It's surprising!

Yes, that is total CPU consumption.

if your app is throughput-centric, Is it possible for you guys to accumulate the small writes from the app layer before passing it to Quinn API?

Our app is a router. As such its driven by whatever is received from the network (small messages) and sends them somewhere else. We are very latency sensitive, so adding any artificial delay in the hopes of accumulating some larger batch is fundamentally at odds with our goals. Whatever is received on a given QUIC stream is generally routed to the same place(s), so to the extent we could arrange to keep these receives affinitized to the same core/queue, and not be work-stolen by any available core, it would help (I think, based on prior experience) for the batch to form there before we lump-send everything available in that poll to the routed send queues.

Matthias247 commented 1 year ago

You might want to grab Connection Stats and emit them as metrics or in some other way periodically. That will tell you whether more frames are transmitted, and of which type they are.

Kiddinglife commented 1 year ago

Is there plan to better support mutiple threads Tokio runtime ? sometimes we need more threads to handle other workloads other than quinn network. eg, a game server that is based on tokio runtime with mutiple threads handling network, collision detection, physical update, battle and data propagation and more. for collison detection, we have to use mutiple threads and no good way to balance the such workloads becasue those calculations requires all player position data in the same memory with lowest latency. if only supporting one-quinn-per-core, we have to use mutiple runtimes which sometimes is quite complex and unreasonable. Did you consider to bind a specific thread or async-channel to avoid the use of locks when intergating with Tokio ? I mean the goal is to not downgrade throughtputs even with the use of mutiple threads Tokio runtime. Thx

Ralith commented 1 year ago

Quinn supports the multithreaded runtime well, and already uses channels to communicate between tasks. Depending on exact workload, you may see better performance in different configurations. If you'd like help optimizing a specific workload, please open a new issue with details.

quinn-rs / quinn

Performance under tokio multi_thread vs current_thread runtime #1433