Closed gretchenfrage closed 1 month ago
This was very likely to be an issue in practice because 0-RTT packets are likely to sent, and hence received, immediately after the first Initial, and might hence routinely be processed by the
Endpoint
before the application has even had a chance to see the Incoming.
Indeed, I discovered this not through thinking about the code really hard but through empirical testing. I was a bit embarrassed when I realized that the problem was caused by me, I initially thought it was pre-existing.
How do we limit the number of these? Buffer size limits will not be an effective defense if the number of buffers is unbounded.
Good catch. And solving this is kind of tricky. But here's an approach I've implemented now that I think works, let me know what you think:
Firstly, we can now move the MAX_INCOMING_CONNECTIONS
check from quinn to proto. (Also, rename it to just MAX_INCOMING
, which we sort of forgot to do in the original PR).
However, that is not sufficient to prevent memory exhaustion via filling many incoming with buffered early data, because these limits multiply to too high of a number. With the default transport config:
(100 max concurrent bidi streams + 100 max concurrent uni streams) × 1.25 MB stream receive window = 250 MB
× 2^16 max incoming = 1.64 TB
Which I'm pretty sure is a lot of RAM. So to avoid that, we implement another MAX_ALL_INCOMING_BUFFERED
limit set to 100 MiB which is a limit to the total number of bytes buffered for incomings across all incoming within the connection. If either the per-incoming limit or the MAX_ALL_INCOMING_BUFFERED
limit are exceeded, the packet is dropped.
So to summarize:
MAX_INCOMING
= 1^16, endpoint automatically responds with refuse packet.min(receive_window, (max_concurrent_bidi_streams + max_concurrent_uni_streams) * stream_receive_window)
, endpoint drops the packet.MAX_ALL_INCOMING_BUFFERED
= 100 MiB, endpoint drops the packet.I'm not sure how to debug the freebsd CI failure, I don't see logs for it
That's weird, normally it logs just fine. GitHub infra flake, perhaps? I've restarted it manually.
Thanks. It passed this time, so I guess it was just a spurious CI failure.
Should we expose the new limits on ServerConfig? 100MiB makes sense for a lot of applications, but someone might want to use Quinn in a memory-constrained environment (e.g. a router).
Good idea. Now that this is all happening in proto, there's no friction to us doing that. Added three new settings to ServerConfig
: max_incoming
, max_buffer_bytes_per_incoming
, and max_buffer_bytes_all_incoming
.
What if a connection is configured for use with application datagrams (or some other future messaging primitive) only, and stream limits are hence set to zero? Might be simplest to only use a dedicated limit.
I made it so the default ServerConfig
constructor calculates max_buffer_bytes_per_incoming
as was previously calculated there. Although I did switch it to just hard-code the overhead to 1200 bytes--let me know if you want me to not do it like that, but it seems more straightforward than the trick with comparison. Anyways, if a user configures the stream limits to zero, but still wants 0-RTT datagrams in excess of 1200 bytes to work more-so than they otherwise would, they can manually set max_buffer_bytes_per_incoming
to some appropriately high number. Let me know if you think this is a pitfall sufficiently likely to be worth documenting.
It's confusing that this clause checks after adding the next datagram whereas the first clause checks before.
Rounding up to the next packet isn't necessarily about allowing for overhead (which might vary dramatically according to sender discretion), just erring on the side of being permissive.
If we use < here, then 0 has a more intuitive effect.
I believe these should all have been fixed in this change.
Thanks for the feedback. My last round of changes was a bit sloppier than I would've liked it to be, especially with me missing the +=
bug. I think the feedback should be addressed now though.
checked_add
in a way I think is a lot cleaner.This all seems like substantial complexity. How do we feel about alternative solutions where we force callers to address one
Incoming
at a time? Do we really need the flexibility of keeping multiple incoming packets in flight at the same time? It feels like a larger change that is sort of flying in here under the radar as a very flexible solution to a problem (the race condition identified in #1820), with the flexibility also causing an increase in fragility.
It's a worthwhile question.
Merely limiting there to being only one Incoming
at a time would not let us fully avoid buffering early datagrams not associated with a connection handle. It would allow us to indiscriminately put all early datagrams not associated with a connection handle into a single buffer rather than in per-Incoming
buffers, but that doesn't seem like much of a simplification.
To avoid having to buffer early datagrams not associated with a connection handle, we would need to be able make a decision whether to accept/refuse/retry/ignore an incoming connection attempt immediately and synchronously when the endpoint receives the connection-creating packet and before it tries to process any further packets (as the decision affects which subsequently received packets get routed to a connection and which get discarded).
This could be achieved by removing the Incoming
API as it exists now and instead just putting some sort of incoming_logic: Box<dyn FnMut(&IncomingConnectionInfo) -> IncomingConnectionDecision>
field in the server config which the endpoint can call synchronously. I don't think we should do that.
One example of a situation where allowing multiple Incoming
to be in motion at the same time would be useful is if there is an IP block list (or allow list) that's stored in a database but has an in-memory Bloom filter to accelerate it:
while let Some(incoming) = endpoint.accept().await {
let ip = incoming.remote_address().ip();
if !block_list_bloom_filter.maybe_contains(ip) {
task::spawn(async move {
handle_incoming(incoming.accept().expect("TODO put real error handling here")).await;
});
} else {
task::spawn(async move {
if !block_list_database.contains(ip).await {
handle_incoming(incoming.accept().expect("TODO put real error handling here")).await;
}
});
}
}
This would be a situation where allowing multiple Incoming
in motion simultaneously, and even allowing the application to make decisions on them in a different order than they were produced, could improve performance and/or attack mitigation effectiveness.
To avoid having to buffer early datagrams not associated with a connection handle, we would need to be able make a decision whether to accept/refuse/retry/ignore an incoming connection attempt immediately and synchronously when the endpoint receives the connection-creating packet and before it tries to process any further packets (as the decision affects which subsequently received packets get routed to a connection and which get discarded).
I was thinking we might have accept()
take ownership of the Endpoint
, storing the Endpoint
in Incoming
, and "giving it back" after accept/reject/ignore. This would still allow the caller to handle incoming connections asynchronously, but not in parallel (thus avoiding buffering issues). It's less flexible but simpler in the end.
Let's see what @Ralith thinks.
have accept() take ownership of the Endpoint
This would complicate the async layer considerably. We would need to move the proto::Endpoint
out of the quinn::endpoint::State
each time a connection is incoming, then restore it and wake the endpoint driver after a decision is made. It's also difficult to allow such an accept
to be used concurrently with any other endpoint access (e.g. connect
, close
, assorted getters). Because typical servers will always be waiting on accept
, surfacing such a limitation to the public API is likely to degrade usability.
More importantly, it wouldn't solve anything: between GRO and recvmmsg, we may receive many datagrams instantaneously. One batch might include both the first datagram for an incoming connection and follow-up initial or 0-RTT packets for that same connection. These must be either buffered or (as in the status quo) dropped, and it's most convenient to do so at the proto layer, where we at least have the option to correlate them, discard non-QUIC datagrams, and respond directly for stateless cases.
Finally, if we could suspend receipt of datagrams immediately after receiving a connection's first datagram, that would be at odds with future work to parallelize datagram receipt and other endpoint driver work, which is our major remaining milestone for intra-endpoint scaling.
In sum, the current form of this PR is strongly motivated.
Comments tweaked. Tests added. As suggested, new tests are based on zero_rtt_happypath
test and validate the number of dropped packets. Added to TestEndpoint
/ IncomingConnectionBehavior
the ability to have Incoming
be pushed to a waiting_incoming
vec to be dealt with later rather than dealt with immediately.
Thanks for all the effort here!
Closes #1820
The fix:
Additional considerations:
To help protect against memory exhaustion attacks, per-Incoming buffered data is limited to twice the receive window or 10 KB, which- ever is larger. Excessive packets silently dropped.
Is this a severe problem? Here's two reasons I don't think so:
Could this be avoided? Possibly by introducing additional state to the buffering state to validate whether these packets are validly encrypted for the associated connection? However, that may risk making these operations costly enough that they start to defeat the DDOS-resistance abilities of the Incoming API.