quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.85k stars 394 forks source link

Unify connection and endpoint drivers #1219

Closed Ralith closed 11 months ago

Ralith commented 3 years ago

Fixes https://github.com/quinn-rs/quinn/issues/914. This is a high-risk change, so we probably don't want it in 0.8, but I'm getting a head start on it all the same.

Ralith commented 3 years ago

This could probably use some more iteration, but is now functionally complete. Next step is careful benchmarking.

djc commented 2 years ago

Is this still waiting for benchmarking?

Ralith commented 2 years ago

Initial results were inconclusive, with increases in ACK frames sent, presumed due to reduced internal buffering. I've therefore been waiting for @Matthias247 to finish his WIP delayed ACK support which should remove that variable.

DemiMarie commented 2 years ago

How much overhead results from the use of atomic operations and mutexes? I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations. Users would be expected to shard their program across cores using SO_RESUEPORT or similar.

DemiMarie commented 2 years ago

RDMA verbs are thread-safe, but I wonder if their performance is hurt less due to less reference counting.

Ralith commented 2 years ago

How much overhead results from the use of atomic operations and mutexes

Approximately none. CPU time is mostly spent in crypto, syscalls, and memory management, IIRC (regardless of this PR). If you want to profile and experiment yourself, that would be very welcome!

I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations

Very early versions of quinn took this approach. It didn't seem to help.

Users would be expected to shard their program across cores using SO_RESUEPORT or similar.

This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.

DemiMarie commented 2 years ago

Users would be expected to shard their program across cores using SO_RESUEPORT or similar.

This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.

My understanding is that load balancers can only send the packet to the correct device, and something else has to get the packet to the correct thread. This can be done using hardware or software RSS or a custom software eBPF program. One still needs to handle the occasional misrouted packet due to migration, but that is rare enough it can be a slow path.

Ralith commented 2 years ago

That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.

DemiMarie commented 2 years ago

That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.

That assumes you have unlimited addresses, but I see your point. Also forwarding of misdirected packets needs to be addressed.

Ralith commented 2 years ago

That assumes you have unlimited addresses

It assumes you have more addresses available than CPU cores. Recall that UDP ports alone trivially give you ~2^16 addresses.

Also forwarding of misdirected packets needs to be addressed.

A load balancer routing packets deliberately does not have that problem.

DemiMarie commented 2 years ago

Also forwarding of misdirected packets needs to be addressed.

A load balancer routing packets deliberately does not have that problem.

Depends on whether the load balancer can use connection IDs, or is only able to use the 5-tuple. Lots of switches are only capable of the latter, and hardware load balancers are much more expensive.

Matthias247 commented 2 years ago

I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.

But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.

DemiMarie commented 2 years ago

I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.

But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.

Migration is sufficiently rare that one can explicitly send misrouted packets to the correct server.

Ralith commented 2 years ago

This has been rebased, but the DelayQueue it introduces depends on tokio::time::Sleep, so additional work is required to be runtime-independent.

djc commented 2 years ago

Would it be helpful to land some of the smaller refactoring commits already, to clarify this PR?

Ralith commented 2 years ago

Sure, I can split those out tonight.

WilliamVenner commented 2 years ago

Hi, I gave your branch a try while doing some benchmarks of my own. I did not find any significant difference in the transfer speed of a 500 MiB file.

Code I used: https://github.com/WilliamVenner/quictransfer/blob/master/src/main.rs

Ralith commented 2 years ago

Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.

Matthias247 commented 2 years ago

Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.

This is about efficiency, not about performance/throughput - right? There's no loss in that setup? I think the evaluation for efficiency should be made on the benchmarks in the quinn repository. Everything else just adds code on top of it which makes it harder to understand what actually causes the degradation.

Ralith commented 2 years ago

I understood that post to be reporting decreased throughput when CPU-bound: 2412 Mbps without this branch, and 1913.57 Mbps with it. Certainly a minimal standalone repro would be more actionable.

Matthias247 commented 2 years ago

Maybe it's because they use async-std in the benchmark which is multithreaded, and then crypto runs in a different thread than IO. In combination with them picking an order of magnitude worse performing cipher suite (ChaCha20) for an unknown reason, which makes the now combined IO/crypto thread doing more work.

Matthias247 commented 2 years ago

When running this with uploads from multiple clients, I get a panic in the delay queue:

 RUST_BACKTRACE=1 cargo run --release -- -c 24
 --download-size 0 --upload-size 100M --stats

    Finished release [optimized + debuginfo] target(s) in 1.55s
     Running `/mnt/c/Users/matth/Code/rust/quinn/target/release/bulk -c 24 --download-size 0 --upload-size 100M --stats`

thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:84:21
stack backtrace:
   0: std::panicking::begin_panic
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
   1: quinn::delay_queue::DelayQueue<T>::poll
   2: quinn::endpoint::EndpointInner::drive_timers
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:499:39
   3: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:307:23
Matthias247 commented 2 years ago

Debug build panics at a different place:

thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:235:13
stack backtrace:
   0: std::panicking::begin_panic
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
   1: <slab::Slab<T> as core::ops::index::IndexMut<usize>>::index_mut
             at /home/matthias/.cargo/registry/src/github.com-1ecc6299db9ec823/slab-0.4.7/src/lib.rs:1192:18
   2: quinn::delay_queue::DelayQueue<T>::list_unlink
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:235:13
   3: quinn::delay_queue::DelayQueue<T>::unlink
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:224:9
   4: quinn::delay_queue::DelayQueue<T>::reset
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9
   5: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:331:29

and

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `Some(Timer(19))`,
 right: `None`: head of list has no predecessor', quinn/src/delay_queue.rs:218:13
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: quinn::delay_queue::DelayQueue<T>::unlink
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:218:13
   5: quinn::delay_queue::DelayQueue<T>::reset
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9

and

thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
  left: `None`,
 right: `Some(Timer(9))`: successor links to head', quinn/src/delay_queue.rs:79:21
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: quinn::delay_queue::DelayQueue<T>::scan_bottom
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:79:21
   5: quinn::delay_queue::DelayQueue<T>::poll
             at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:61:34
   6: quinn::endpoint::EndpointInner::drive_timers
Matthias247 commented 2 years ago

Due to the panics on all multi-connection tests, I only got benchark results for a single connection. Looks like an improvement however!

Main

Throughput: 580 MB/s RTT: Client 180 us, Server 170 us CWND Server: 1103736476 (extremely ridiculous)

Unify Drivers

Throughput: 610MB/s RTT: Client 240us, Server 440us CWND Server: 51262370

djc commented 2 years ago

So throughput improves 5%, but latency is 33%-160% worse?

Ralith commented 2 years ago

I wouldn't read too heavily into μs variations in latency. That's probably either noise or inconsequential constant variation.

Good catch on the multi-connection failures, will investigate.

Ralith commented 2 years ago

reset was corrupting the intrusive list. Fixed.

Matthias247 commented 2 years ago

I wouldn't read too heavily into μs variations in latency.

Agreed. It seems to switch between runs in the 100-500us range, and that's all fine. The multi-millisecond delays that occur in other tests are the more concerning ones, since they are symptoms of excessive buffering.

I tried now again with multiple connections, and it seem to work fine. Did a 24 connection (1 connection per core) upload and download. Results are pretty much equivalent between main and unify-drivers, which are both around:

24 connections

Throughput: Each Stream 13.37 MB/s. Total: 316 MB/s RTT: Client 72ms, Server 60ms CWND Server: 50651475

24 connections upload

Throughput: Each Stream 10.10 MB/s. Total: 242MB/s RTT: Client: 390us, Server 3.3ms CWND Client: 304000

Based on that I think merging is fine. And as a follow-up it could be investigated how to avoid the large RTT under heavy server load (I assume it's due to the connections all producing packets for the endpoint, and that one not being able to keep up with sending).

Ralith commented 2 years ago

Note that the bulk benchmark uses a single thread for the server. If I use a multithreaded runtime for the server (but retain one thread per client) and use cores / 2 clients (so one core is available for each half of a connection) I see a ~15% speedup on main (presumably from parallel crypto), and a ~15% slowdown on this branch (presumably from inter-thread comms overhead and lock contention).

On the other hand, in a hypothetical configuration with a totally discrete server endpoint per core, we'd almost certainly see a linear speedup, on top of the baseline efficiency and consistency improvements on this branch. I think that's the case that makes sense to optimize for, since it scales better in large-scale use which must distribute load across machines anyway, and for very small-scale use like game servers where most CPU is being spent elsewhere, but it's not a 100% clean win, especially for naive/mid-scale users that won't have load balancing as a matter of course.

Ralith commented 2 years ago

If you discount delay_queue.rs (which I think is mostly fair, since it's strongly encapsulated), this is net -119 LoC.

Ralith commented 2 years ago

Since it's neat, might be useful in other projects, and much easier to benchmark this way, I've split delay_queue.rs from this out into a separate crate, timer-queue. Should we depend on that rather than having a separate copy? If so, should we move maintenance of it under quinn-rs?

Ralith commented 1 year ago

Rebased; friendly bump.

Ralith commented 1 year ago

Dropped a superfluous refactoring commit.

djc commented 1 year ago

Sorry for the slow response to this -- prioritizing soaking up rustls review bandwidth while it's available.

Ralith commented 1 year ago

prioritizing soaking up rustls review bandwidth while it's available.

Makes sense; that's definitely a precious resource! I don't think this is blocking anything immediate; @rrichardson's interest in horizontal scaling as discussed in #914 can be pursued independently at the proto layer.

mxinden commented 1 year ago

:wave: rust-libp2p maintainer here,

Is there any way I could help move this pull request forward?

Ways that come to my mind:

My motivation:

On a general note, thank you for all your work! With the release of QUIC support in rust-libp2p via quinn-proto:

Ralith commented 1 year ago

Thanks for sharing those numbers, that's really cool! I think knowing there's demand is helpful already, but thoughtful review of the core 294e5d7 commit that I think @djc is mainly hung up on certainly wouldn't hurt.

DemiMarie commented 1 year ago

@mxinden @Ralith This does not remove all unbounded channels. What would it take to remove the rest of them?

djc commented 1 year ago

Is there any evidence that this particular usage of unbounded channels is bad? If you are so worried about this, feel free to submit a PR and we'll consider it.

Ralith commented 1 year ago

The remaining "unbounded" channel introduced in this PR, in fact, bounded by the number of concurrently active connections. It could be replaced with various other sorts of queue to make that more obvious (an intrusive list might be cool), but the behavior would be the same.

mxinden commented 1 year ago

@mxinden @Ralith This does not remove all unbounded channels. What would it take to remove the rest of them?

@DemiMarie are you referring to the unbounded channel used by a connection to wake up the endpoint?

    /// `handle` is sent here when `wake` is called, prompting the endpoint to drive the connection
    dirty: mpsc::UnboundedSender<ConnectionHandle>,

https://github.com/quinn-rs/quinn/pull/1219/files#diff-60e27999f6922becedd55cb952eb6a67db93ed5d136f2e22451d0ce83a73b89bR779-R780

I don't think this poses any issues. As @Ralith says above, they are bounded by the number of connections. I consider the usage a neat pattern of many-to-wake-one, transporting the id along the way.

Is there any evidence that this particular usage of unbounded channels is bad?

In case I am not missing something and indeed we are talking about the dirty unbounded channel, I don't see a downside to the usage.

In case one would want to ban the usage of unbounded channels across the codebase, one could also use a bounded channel. Given that the sender is cloned for each connection, there is a guaranteed slot per sender, thus always capacity to send, no .await needed. (Not advocating for it.)

The channel’s capacity is equal to buffer + num-senders.

https://docs.rs/futures/latest/futures/channel/mpsc/fn.channel.html

thomaseizinger commented 1 year ago

In case one would want to ban the usage of unbounded channels across the codebase

FWIW, clippy can help with this using its disallowed_methods lint:

disallowed-methods = [
  { path = "futures::channel::mpsc::unbounded", reason = "does not enforce backpressure" },
]
djc commented 1 year ago

@mxinden thanks for the careful review!

Ralith commented 1 year ago

Tweaked per @mxinden's feedback (thanks!) and rebased.

Arqu commented 1 year ago

Hey 👋

We're using quinn in iroh. I've run some quick benchmarks with a branch propped up to use this PR. I've noticed a significant drop in throughput (40-50%) when serving data from one node to 2 or more clients. One to one transfers don't seem significantly impacted yet (most probably because other parts of our code don't saturate the pipe).

From my observations with the PR the CPU on my test rig seems to peg to around 220% ie 2.2 cores and more clients just keep diluting the throughput between them. The same test when running against 0.9.3 seems to be able to utilize the CPU for as much as it is requested for and scaling ~linearly with clients. Seeing some 700-900% in the 1 to 10 test cases.

Didn't dive too deep into it yet, but my initial guess would be that we have either some mutex locking or polling happening that makes some run loop starve itself.

Let me know if there's anything specific you would like me to poke at or test out.

Ralith commented 1 year ago

Interesting, thank you for the report! We should try to understand that, and hopefully mitigate, before merging this. I do think this is an important foundation for an ultimately much more parallel quinn, but undermining our present scalability for a hypothetical future isn't ideal.

Ralith commented 11 months ago

Closing as this is quite stale and understood to be a suboptimal direction due to reduced opportunity for work stealing and parallelism.