Transmit directly from connection tasks

Ralith commented 11 months ago

This allows outgoing data to parallelize perfectly. Initial informal testing suggests a performance improvement for bulk data, likely due to reduced allocation and cross-task messaging. A larger performance benefit should be expected for endpoints hosting numerous connections.

I'd left the original userspace-multiplexed transmit strategy in place for a long time because I assumed the UDP socket had a mutex-guarded send buffer, so leaning into task-parallelism for sending would just lead to contention. This turns out to be complete nonsense, at least on Linux: outgoing UDP datagrams are buffered by the kernel with dynamically allocated memory, and the primitive NIC queuing operation is apparently scalable, as borne out by empirical testing. The following minimal test scales almost linearly with physical parallelism on Linux:

use std::{
    net::{SocketAddr, UdpSocket},
    thread,
    time::Instant,
};

fn main() {
    let sock = UdpSocket::bind("[::]:0").unwrap();
    let target: SocketAddr = "[::1]:1234".parse().unwrap();
    let start = Instant::now();
    const DATAGRAMS: usize = 1_000_000;
    thread::scope(|scope| {
        let threads = thread::available_parallelism().unwrap().get();
        dbg!(threads);
        let each = DATAGRAMS / threads;
        let sock = &sock;
        for _ in 0..threads {
            scope.spawn(move || {
                for _ in 0..each {
                    sock.send_to(b"hello, world", target).unwrap();
                }
            });
        }
    });
    let seconds = start.elapsed().as_secs_f32();
    println!("{} pps", DATAGRAMS as f32 / seconds);
}

It'd be interesting to see how this compares on other major platforms, but Linux's drastic improvement and the simplification of Quinn's internals are enough to satisfy me that we should go this way.

Ralith commented 11 months ago

Hmm, this needs a mechanism to broadcast wakeups when the UDP socket is backpressured...

edit: solved by duplicating (try_cloneing) the socket

Ralith commented 11 months ago

I've replaced the original try_clone_box mechanism with the runtime's built-in concurrent I/O support, which exists in both tokio and async-std. This required a somewhat fiddly extension of the runtime abstraction layer, because for both runtimes, concurrent I/O is only possible if you use async fn interfaces, probably due to self-reference somewhere in the critical futures. Luckily, the fiddly bits are isolated almost exclusively within runtime.rs, with both implementations and callers remaining straightforward. See discussion at https://github.com/tokio-rs/tokio/pull/6226 for context.

flub commented 11 months ago

Hi, curious to try this out on various other platforms as well. How have you been running benchmarks to compare this?

I guess any perf loss due to losing sendmmsg is offset by the perf gain of not doing the userspace-multiplexing. sendmmsg does give some small benefit when you manage to fill the pipe to many peers in one syscall, but I'd have to spend more time figuring out how often quinn really manages to do that.

Ralith commented 11 months ago

I haven't benchmarked this very rigorously yet, since I don't have convenient access to a machine that's both running Linux (for our most optimized UDP backend) and not a laptop (subject to unpredictable throttling). We could use more data on that, if you're interested. I do plan to do some benchmarking on Windows, but we have neither sendmmsg nor GSO there just yet, so it might be a bit unfair to the status quo.

For a single connection, sendmmsg is strictly worse than GSO, and for multiple connections I think the near-linear speedup from parallelism will be a much bigger win in any case, so I'm not too concerned about regressions, though it's always possible we might miss something silly.

Ralith commented 10 months ago

Rebased to resolve conflicts with recent quinn-udp improvements.

PureWhiteWu commented 10 months ago

Wow, I tried this pr, and this pr perfectly solves our problems! I've benched on our linux server machine(256 Cores, the quinn server taskseted to 16C), and the server send bandwidth of 4500 connections raised from about 300MB/s to 900MB/s (the desired rate)! Great work! Thank you very much! Really looking forward for this PR to be merged!

PureWhiteWu commented 10 months ago

Also, just curious, is there any way to use sendmmsg back together with the perfect parallelization?

The CPU usage raised from about 400% to 680% with this PR (at a fixed sending rate of 900 MB/s), and the percent of syscall (previously sendmmsg and now sendmsg) raised from 25% to 45%.

If there's no way to achieve this, then I think the parallelization is more important.

PureWhiteWu commented 10 months ago

Seems that when it comes to 9000 connections(taskseted to 48C), still has some problems

Here's the flamegraph for 9000 connections:

flame.html.zip

Ralith commented 10 months ago

Wow, I tried this pr, and this pr perfectly solves our problems!

Awesome, thanks for the report! That's exactly the kind of data I've been hoping for, since this seems to be a no-op on our single-connection benchmarks despite reduced wakeups and allocation.

Also, just curious, is there any way to use sendmmsg back together with the perfect parallelization?

When sending from a single connection at a time, we're almost invariably sending to a single address, so sendmmsg has nothing to offer compared to the segmentation offload we already have. It might be interesting to play with MAX_TRANSMIT_SEGMENTS in quinn-proto/connection/mod.rs to see if that improves your performance further; I think this change may have removed the drawbacks to raising that number, and perhaps we should do away with the additional limit entirely.

Hypothetically we could have one sendmmsg task per core, handling transmits for a proportional subset of connections. This could hypothetically reduce the number of system calls under some workloads, at the cost of greatly increased complexity to coordinate connections. It's unclear if this could actually realize a performance benefit in practice. QUIC seems to benefit far more from GSO than merely reducing system call rate; it seems that the slowest part of sendmsg is the per-datagram work in the kernel.

The CPU usage raised from about 400% to 680% with this PR (at a fixed sending rate of 900 MB/s), and the percent of syscall (previously sendmmsg and now sendmsg) raised from 25% to 45%.

Given the speedup, this is mostly indicating that your cores are spending more time doing useful work, rather than sitting idle waiting for the endpoint task to do something.

Seems that when it comes to 9000 connections(taskseted to 48C), still has some problems

There will always be some level of activity that can't be served within a given hardware budget, so it's not immediately obvious to me that this is a problem, unless your CPU is obviously being underutilized. To avoid catastrophic failure under load, you can configure Quinn to limit the number of active connections, or even dynamically adjust that limit based on measured load.

There's always some wiggle room for tuning, of course. Try playing with MAX_TRANSMIT_SEGMENTS to get more mileage out of each call, and maybe your UDP socket's SO_SNDBUF and kernel receive buffer sizes in case packets are being lost there.

Nothing jumps out at me as too sketchy Quinn-wise on the flamegraph. However, it's interesting that <tokio::time::sleep::Sleep as core::future::future::Future>::poll seems to take up almost 20% of CPU time per thread(?), and mostly in trying to acquire a mutex. This is probably contention around Tokio's global timer wheel. I brought this up briefly on the Tokio discord and it sounds like there's significant opportunity for reducing contention there, and they would likely welcome contributions to optimize it.

PureWhiteWu commented 10 months ago

Thank you very much for your so detailed response!

There will always be some level of activity that can't be served within a given hardware budget, so it's not immediately obvious to me that this is a problem, unless your CPU is obviously being underutilized.

Just a clarification, in the 9000 connections bench, the CPU limit is 4800% but only utilized about 1400%, so I think it's nothing to do with CPU usage.

I will try MAX_TRANSMIT_SEGMENTS later.

Really looking forward to this pr to be merged!

Ralith commented 10 months ago

Just a clarification, in the 9000 connections bench, the CPU limit is 4800% but only utilized about 1400%, so I think it's nothing to do with CPU usage.

Could be a buffer size issue, and/or receive-side bottlenecking on the endpoint task. I have some ideas for parallel receive tasks using SO_REUSEADDR which could help, though parallelizing intra-endpoint ops will take some care.

Scaling to truly large numbers will ultimately need some sort of load-balancing between multiple endpoints, potentially even on multiple machines, but we've definitely got room for improvement here still.

flub commented 7 months ago

So I finally got round to doing some kind of decent tests on this. I used our setup to notice performance regressions on our end. It's not perfect, but it's some kind of indicator.

These numbers are all in gbps, but relative numbers indicate more than the absolute.

|         | 0.10 | 0.11 | parallel-transmit |
|---------+------+------+-------------------|
| 1-to-1  | 1.96 | 2.13 |              0.83 |
| 1-to-3  | 5.96 | 2.74 |              2.71 |
| 1-to-5  | 5.40 | 2.28 |              3.18 |
| 1-to-10 | 5.33 | 2.12 |              3.09 |
| 2-to-2  | 4.19 | 3.87 |              1.85 |
| 2-to-4  | 7.32 | 4.46 |              4.07 |
| 2-to-6  | 7.64 | 4.55 |              4.72 |
| 2-to-10 | 7.82 | 4.25 |              5.30 |

So I started with what is the current baseline, in the 0.10 column. The 0.11 column is our stack ported to current quinn main branch. Then the final column is it ported to the parallel-transmit branch.

There is already a pretty significant regression in the main branch, this porting was relatively easy so I'm fairly confident that is realistic. This is a bit scary, I was hoping that the parallel-transmit branch would compensate for that. Unfortunately it does not really. This port was rather involved and it might well be that there are improvements in there, especially given that others have indicated this resulted in good speedups. I know there are some places where I took shortcuts that won't do any good. But I'm about to go on a holiday so wanted to post this early.

Now obviously we do weird things with our stack at https://github.com/n0-computer/iroh/. We present a single AsyncUdpSocket to quinn and send via different sockets depending on the destination and whether we managed to holepunch etc. We know that's not optimal and weird. Regardless so far this is somewhat disappointing to me and will need further investigation on our end.

Ralith commented 7 months ago

Thanks for the report! Consider opening a separate issue for regressions in 0.11 base, which are out of scope of this PR. A few questions:

Can you reproduce this with the standard tokio-based AsyncUdpSocket?
What is your sample size? Bandwidth tests tend to be noisy, especially if ran for short periods.
How many threads do your endpoints have?
How busy are those threads? Are you CPU-bound, synchronization-bound, congestion-bound, or something else?
Does your custom AsyncUdpSocket have any internal synchronization? This wouldn't have been under any pressure previously, but could cause heavy contention in this branch.
How does your UdpPoller work? Do you handle concurrent waiters correctly? Missed wakeups due to bugs here could cause arbitrarily bad performance with low CPU load.

Ralith commented 7 months ago

Rebased.

Ralith commented 7 months ago

Rebased.

quinn-rs / quinn

Transmit directly from connection tasks #1729