Closed Ralith closed 11 months ago
This could probably use some more iteration, but is now functionally complete. Next step is careful benchmarking.
Is this still waiting for benchmarking?
Initial results were inconclusive, with increases in ACK frames sent, presumed due to reduced internal buffering. I've therefore been waiting for @Matthias247 to finish his WIP delayed ACK support which should remove that variable.
How much overhead results from the use of atomic operations and mutexes? I wonder if a better approach would be to make everything !Send
and !Sync
and to use non-atomic operations. Users would be expected to shard their program across cores using SO_RESUEPORT
or similar.
RDMA verbs are thread-safe, but I wonder if their performance is hurt less due to less reference counting.
How much overhead results from the use of atomic operations and mutexes
Approximately none. CPU time is mostly spent in crypto, syscalls, and memory management, IIRC (regardless of this PR). If you want to profile and experiment yourself, that would be very welcome!
I wonder if a better approach would be to make everything !Send and !Sync and to use non-atomic operations
Very early versions of quinn took this approach. It didn't seem to help.
Users would be expected to shard their program across cores using SO_RESUEPORT or similar.
This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.
Users would be expected to shard their program across cores using SO_RESUEPORT or similar.
This is appealing at first, but unfortunately routing UDP datagrams to the correct thread is difficult. Further, any genuinely large-scale deployment of QUIC will have a load-balancing front-end that renders such measures redundant, so the set of users who might stand to benefit seems very small.
My understanding is that load balancers can only send the packet to the correct device, and something else has to get the packet to the correct thread. This can be done using hardware or software RSS or a custom software eBPF program. One still needs to handle the occasional misrouted packet due to migration, but that is rare enough it can be a slow path.
That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.
That's incorrect, or at best needlessly complicated. A device with multiple independent threads can be seamlessly handled by a load balancer by giving each thread its own address. Reusing the same port only adds complexity.
That assumes you have unlimited addresses, but I see your point. Also forwarding of misdirected packets needs to be addressed.
That assumes you have unlimited addresses
It assumes you have more addresses available than CPU cores. Recall that UDP ports alone trivially give you ~2^16 addresses.
Also forwarding of misdirected packets needs to be addressed.
A load balancer routing packets deliberately does not have that problem.
Also forwarding of misdirected packets needs to be addressed.
A load balancer routing packets deliberately does not have that problem.
Depends on whether the load balancer can use connection IDs, or is only able to use the 5-tuple. Lots of switches are only capable of the latter, and hardware load balancers are much more expensive.
I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.
But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.
I don't think this discussion adds a lot to this PR, since it's completely orthogonal and might be better off in it's own topic.
But yes, Demi is right that a there can be different types of packet load balancers for horizontal scaling. And they might either just route by 5tuple (in which case connection migration would not necessarily work - since packets will be misrouted after migration), or by connection ID.
Migration is sufficiently rare that one can explicitly send misrouted packets to the correct server.
This has been rebased, but the DelayQueue
it introduces depends on tokio::time::Sleep
, so additional work is required to be runtime-independent.
Would it be helpful to land some of the smaller refactoring commits already, to clarify this PR?
Sure, I can split those out tonight.
Hi, I gave your branch a try while doing some benchmarks of my own. I did not find any significant difference in the transfer speed of a 500 MiB file.
Code I used: https://github.com/WilliamVenner/quictransfer/blob/master/src/main.rs
Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.
Rebased and addressed outstanding feedback. Note outstanding report of a performance degradation at https://github.com/libp2p/rust-libp2p/pull/2801#issuecomment-1244567213, to be investigated.
This is about efficiency, not about performance/throughput - right? There's no loss in that setup? I think the evaluation for efficiency should be made on the benchmarks in the quinn repository. Everything else just adds code on top of it which makes it harder to understand what actually causes the degradation.
I understood that post to be reporting decreased throughput when CPU-bound: 2412 Mbps without this branch, and 1913.57 Mbps with it. Certainly a minimal standalone repro would be more actionable.
Maybe it's because they use async-std in the benchmark which is multithreaded, and then crypto runs in a different thread than IO. In combination with them picking an order of magnitude worse performing cipher suite (ChaCha20) for an unknown reason, which makes the now combined IO/crypto thread doing more work.
When running this with uploads from multiple clients, I get a panic in the delay queue:
RUST_BACKTRACE=1 cargo run --release -- -c 24
--download-size 0 --upload-size 100M --stats
Finished release [optimized + debuginfo] target(s) in 1.55s
Running `/mnt/c/Users/matth/Code/rust/quinn/target/release/bulk -c 24 --download-size 0 --upload-size 100M --stats`
thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:84:21
stack backtrace:
0: std::panicking::begin_panic
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
1: quinn::delay_queue::DelayQueue<T>::poll
2: quinn::endpoint::EndpointInner::drive_timers
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:499:39
3: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:307:23
Debug build panics at a different place:
thread '<unnamed>' panicked at 'invalid key', quinn/src/delay_queue.rs:235:13
stack backtrace:
0: std::panicking::begin_panic
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:616:12
1: <slab::Slab<T> as core::ops::index::IndexMut<usize>>::index_mut
at /home/matthias/.cargo/registry/src/github.com-1ecc6299db9ec823/slab-0.4.7/src/lib.rs:1192:18
2: quinn::delay_queue::DelayQueue<T>::list_unlink
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:235:13
3: quinn::delay_queue::DelayQueue<T>::unlink
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:224:9
4: quinn::delay_queue::DelayQueue<T>::reset
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9
5: <quinn::endpoint::EndpointDriver as core::future::future::Future>::poll
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/endpoint.rs:331:29
and
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
left: `Some(Timer(19))`,
right: `None`: head of list has no predecessor', quinn/src/delay_queue.rs:218:13
stack backtrace:
0: rust_begin_unwind
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
4: quinn::delay_queue::DelayQueue<T>::unlink
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:218:13
5: quinn::delay_queue::DelayQueue<T>::reset
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:189:9
and
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
left: `None`,
right: `Some(Timer(9))`: successor links to head', quinn/src/delay_queue.rs:79:21
stack backtrace:
0: rust_begin_unwind
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
2: core::panicking::assert_failed_inner
3: core::panicking::assert_failed
at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
4: quinn::delay_queue::DelayQueue<T>::scan_bottom
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:79:21
5: quinn::delay_queue::DelayQueue<T>::poll
at /mnt/c/Users/matth/Code/rust/quinn/quinn/src/delay_queue.rs:61:34
6: quinn::endpoint::EndpointInner::drive_timers
Due to the panics on all multi-connection tests, I only got benchark results for a single connection. Looks like an improvement however!
Throughput: 580 MB/s RTT: Client 180 us, Server 170 us CWND Server: 1103736476 (extremely ridiculous)
Throughput: 610MB/s RTT: Client 240us, Server 440us CWND Server: 51262370
So throughput improves 5%, but latency is 33%-160% worse?
I wouldn't read too heavily into μs variations in latency. That's probably either noise or inconsequential constant variation.
Good catch on the multi-connection failures, will investigate.
reset
was corrupting the intrusive list. Fixed.
I wouldn't read too heavily into μs variations in latency.
Agreed. It seems to switch between runs in the 100-500us range, and that's all fine. The multi-millisecond delays that occur in other tests are the more concerning ones, since they are symptoms of excessive buffering.
I tried now again with multiple connections, and it seem to work fine. Did a 24 connection (1 connection per core) upload and download. Results are pretty much equivalent between main
and unify-drivers
, which are both around:
Throughput: Each Stream 13.37 MB/s. Total: 316 MB/s RTT: Client 72ms, Server 60ms CWND Server: 50651475
Throughput: Each Stream 10.10 MB/s. Total: 242MB/s RTT: Client: 390us, Server 3.3ms CWND Client: 304000
Based on that I think merging is fine. And as a follow-up it could be investigated how to avoid the large RTT under heavy server load (I assume it's due to the connections all producing packets for the endpoint, and that one not being able to keep up with sending).
Note that the bulk
benchmark uses a single thread for the server. If I use a multithreaded runtime for the server (but retain one thread per client) and use cores / 2 clients (so one core is available for each half of a connection) I see a ~15% speedup on main
(presumably from parallel crypto), and a ~15% slowdown on this branch (presumably from inter-thread comms overhead and lock contention).
On the other hand, in a hypothetical configuration with a totally discrete server endpoint per core, we'd almost certainly see a linear speedup, on top of the baseline efficiency and consistency improvements on this branch. I think that's the case that makes sense to optimize for, since it scales better in large-scale use which must distribute load across machines anyway, and for very small-scale use like game servers where most CPU is being spent elsewhere, but it's not a 100% clean win, especially for naive/mid-scale users that won't have load balancing as a matter of course.
If you discount delay_queue.rs (which I think is mostly fair, since it's strongly encapsulated), this is net -119 LoC.
Since it's neat, might be useful in other projects, and much easier to benchmark this way, I've split delay_queue.rs from this out into a separate crate, timer-queue. Should we depend on that rather than having a separate copy? If so, should we move maintenance of it under quinn-rs?
Rebased; friendly bump.
Dropped a superfluous refactoring commit.
Sorry for the slow response to this -- prioritizing soaking up rustls review bandwidth while it's available.
prioritizing soaking up rustls review bandwidth while it's available.
Makes sense; that's definitely a precious resource! I don't think this is blocking anything immediate; @rrichardson's interest in horizontal scaling as discussed in #914 can be pursued independently at the proto layer.
:wave: rust-libp2p maintainer here,
Is there any way I could help move this pull request forward?
Ways that come to my mind:
My motivation:
quinn-proto
in rust-libp2p
.quinn
directly, but instead our own wrapper around quinn-proto
due to quinn
using unbounded channels between the Endpoint and the Connection. See https://github.com/quinn-rs/quinn/issues/1353.quinn
) improvements and fixes.quinn
directly.On a general note, thank you for all your work! With the release of QUIC support in rust-libp2p via quinn-proto
:
We are seeing significant connection establishment improvements in the 95th percentile compared to TCP:
See https://kademlia-exporter.max-inden.de/d/Pfr0Fj6Mk/rust-libp2p for more numbers.
We can leverage stream creation backpressure at the protocol level.
We see significant improvements to our hole punching success rate.
...
Thanks for sharing those numbers, that's really cool! I think knowing there's demand is helpful already, but thoughtful review of the core 294e5d7 commit that I think @djc is mainly hung up on certainly wouldn't hurt.
@mxinden @Ralith This does not remove all unbounded channels. What would it take to remove the rest of them?
Is there any evidence that this particular usage of unbounded channels is bad? If you are so worried about this, feel free to submit a PR and we'll consider it.
The remaining "unbounded" channel introduced in this PR, in fact, bounded by the number of concurrently active connections. It could be replaced with various other sorts of queue to make that more obvious (an intrusive list might be cool), but the behavior would be the same.
@mxinden @Ralith This does not remove all unbounded channels. What would it take to remove the rest of them?
@DemiMarie are you referring to the unbounded channel used by a connection to wake up the endpoint?
/// `handle` is sent here when `wake` is called, prompting the endpoint to drive the connection
dirty: mpsc::UnboundedSender<ConnectionHandle>,
I don't think this poses any issues. As @Ralith says above, they are bounded by the number of connections. I consider the usage a neat pattern of many-to-wake-one, transporting the id along the way.
Is there any evidence that this particular usage of unbounded channels is bad?
In case I am not missing something and indeed we are talking about the dirty
unbounded channel, I don't see a downside to the usage.
In case one would want to ban the usage of unbounded channels across the codebase, one could also use a bounded channel. Given that the sender is cloned for each connection, there is a guaranteed slot per sender, thus always capacity to send, no .await
needed. (Not advocating for it.)
The channel’s capacity is equal to
buffer + num-senders
.
https://docs.rs/futures/latest/futures/channel/mpsc/fn.channel.html
In case one would want to ban the usage of unbounded channels across the codebase
FWIW, clippy can help with this using its disallowed_methods
lint:
disallowed-methods = [
{ path = "futures::channel::mpsc::unbounded", reason = "does not enforce backpressure" },
]
@mxinden thanks for the careful review!
Tweaked per @mxinden's feedback (thanks!) and rebased.
Hey 👋
We're using quinn in iroh. I've run some quick benchmarks with a branch propped up to use this PR. I've noticed a significant drop in throughput (40-50%) when serving data from one node to 2 or more clients. One to one transfers don't seem significantly impacted yet (most probably because other parts of our code don't saturate the pipe).
From my observations with the PR the CPU on my test rig seems to peg to around 220% ie 2.2 cores and more clients just keep diluting the throughput between them. The same test when running against 0.9.3
seems to be able to utilize the CPU for as much as it is requested for and scaling ~linearly with clients. Seeing some 700-900% in the 1 to 10 test cases.
Didn't dive too deep into it yet, but my initial guess would be that we have either some mutex locking or polling happening that makes some run loop starve itself.
Let me know if there's anything specific you would like me to poke at or test out.
Interesting, thank you for the report! We should try to understand that, and hopefully mitigate, before merging this. I do think this is an important foundation for an ultimately much more parallel quinn, but undermining our present scalability for a hypothetical future isn't ideal.
Closing as this is quite stale and understood to be a suboptimal direction due to reduced opportunity for work stealing and parallelism.
Fixes https://github.com/quinn-rs/quinn/issues/914. This is a high-risk change, so we probably don't want it in 0.8, but I'm getting a head start on it all the same.