quinn-rs / quinn

Async-friendly QUIC implementation in Rust
Apache License 2.0
3.57k stars 364 forks source link

Endpoint stats interface #1900

Open ryleung-solana opened 1 week ago

ryleung-solana commented 1 week ago

In Anza, we need to get more insight into the resources being used by quinn in servicing quic handshakes. This PR seeks to add an initial interface to the Endpoint struct to allow for getting more detailed stats than just the open connections, and some initial data relevant to the handshakes. On a side note to the Quinn devs, if there is a good way to support getting the current number of open handshakes, we would be interested in know it. Currently, it seems that the Endpoint is not signaled when a Connecting future finishes, and to add such an event would require some refactoring, not to mention handling an extra event, locking, etc.

ryleung-solana commented 1 week ago

@Ralith @djc one thing that we're particularly interested in is related to the resource usage from crypto, especially in accept() because we don't currently reject any incoming connection requests. We merely accept all incoming connections and then immediately drop some Connecting future based on our own logic.

djc commented 1 week ago

@Ralith @djc one thing that we're particularly interested in is related to the resource usage from crypto, especially in accept() because we don't currently reject any incoming connection requests.

Why not? I think we had some Solana people look at that as an option for this exact reason.

ryleung-solana commented 1 week ago

@Ralith @djc one thing that we're particularly interested in is related to the resource usage from crypto, especially in accept() because we don't currently reject any incoming connection requests.

Why not? I think we had some Solana people look at that as an option for this exact reason.

Mostly because we currently only do some basic rate-limiting and nothing more. It's probably a good idea to look at using reject when over the rate limit instead of accepting and immediately dropping :sweat: but the thing is that we currently don't really have a great idea of what the limit should be, since we don't have insight into the resource usage from handshakes, especially crypto (but we think it's significant).

ryleung-solana commented 1 week ago

Would you guys have any ideas of how we might be able to get some better numbers on e.g. the mem usage from crypto in accept() before the future is returned?

Ralith commented 1 week ago

Can you elaborate on "mem usage from crypto"? I don't expect cryptographic operations to consume significant memory, or really anything but O(input). Usually the concern with handshakes is CPU usage, which could in principle be measured with some sort of high resolution timer around accept and, most importantly, the poll calls on the internal connection driver future.

ryleung-solana commented 2 days ago

Can you elaborate on "mem usage from crypto"? I don't expect cryptographic operations to consume significant memory, or really anything but O(input). Usually the concern with handshakes is CPU usage, which could in principle be measured with some sort of high resolution timer around accept and, most importantly, the poll calls on the internal connection driver future.

I think that's (CPU usage) also something to look at as well. We're looking specifically at x509 verification, as it happens before we get the future returned to us when we accept() making it a prime target for rate limiting given some data. It would also allow us to determine once and for all, whether crypto does indeed take a significant amount of memory under attack (e.g. getting spammed with lots of handshake attempts, or some variation of this), which we suspect from a recent security advisory, but don't know for sure.

djc commented 2 days ago

In general I think a quinn_proto::Connection takes a decent amount of memory, but only a small part of that is due to the crypto involved. To the extent that you can, it's probably better to reject connections based on the Incoming data.

Ralith commented 1 day ago

We're looking specifically at x509 verification, as it happens before we get the future returned to us when we accept() making it a prime target for rate limiting given some data.

This is no longer the case as of 0.11. No CPU-intensive work is performed until you accept a connection.