Fix race condition between 0-RTT and Incoming

gretchenfrage commented 2 months ago

Closes #1820

The fix:

Endpoint now maintains a slab with an entry for each pending Incoming to buffer received data.
ConnectionIndex now maps initial DCID to that slab key immediately upon construction of Incoming.
If Incoming is accepted, association is overridden with association with ConnectionHandle, and all buffered datagrams are fed to newly constructed Connection.
If Incoming is refused/retried/ignored, or accepting errors, association and slab entry are cleaned up to prevent memory leak.

Additional considerations:

The Incoming::ignore operation can no longer be implemented as just dropping it. To help prevent incorrect API usage, proto::Incoming is modified to log a warning if it is dropped without being passed to Endpoint::accept/refuse/retry/ignore.
To help protect against memory exhaustion attacks, per-Incoming buffered data is limited to twice the receive window or 10 KB, which- ever is larger. Excessive packets silently dropped.
- Does this introduce a new vulnerability to an attack in which an attacker could spam a server with 0-RTT packets with the same connection ID as it observed a client attempting to initiate a 0-RTT connection to the server? I do think so.
Is this a severe problem? Here's two reasons I don't think so:
1. The default receive window is set to max value, so this won't actually kick in unless the user is already hardening against adverse conditions. 2. It is already possible for an on-path attacker to distrupt a connection handshake if 0.5-RTT data is being used, so this probably doesn't actually expand the set of situations in which it's vulnerable to this kind of vulnerability.
Could this be avoided? Possibly by introducing additional state to the buffering state to validate whether these packets are validly encrypted for the associated connection? However, that may risk making these operations costly enough that they start to defeat the DDOS-resistance abilities of the Incoming API.

gretchenfrage commented 2 months ago

This was very likely to be an issue in practice because 0-RTT packets are likely to sent, and hence received, immediately after the first Initial, and might hence routinely be processed by the Endpoint before the application has even had a chance to see the Incoming.

Indeed, I discovered this not through thinking about the code really hard but through empirical testing. I was a bit embarrassed when I realized that the problem was caused by me, I initially thought it was pre-existing.

fd22d144b8b65d9b

How do we limit the number of these? Buffer size limits will not be an effective defense if the number of buffers is unbounded.

Good catch. And solving this is kind of tricky. But here's an approach I've implemented now that I think works, let me know what you think:

Firstly, we can now move the MAX_INCOMING_CONNECTIONS check from quinn to proto. (Also, rename it to just MAX_INCOMING, which we sort of forgot to do in the original PR).

However, that is not sufficient to prevent memory exhaustion via filling many incoming with buffered early data, because these limits multiply to too high of a number. With the default transport config:

(100 max concurrent bidi streams + 100 max concurrent uni streams) × 1.25 MB stream receive window = 250 MB

× 2^16 max incoming = 1.64 TB

Which I'm pretty sure is a lot of RAM. So to avoid that, we implement another MAX_ALL_INCOMING_BUFFERED limit set to 100 MiB which is a limit to the total number of bytes buffered for incomings across all incoming within the connection. If either the per-incoming limit or the MAX_ALL_INCOMING_BUFFERED limit are exceeded, the packet is dropped.

So to summarize:

Upon receipt of an incoming-creating packet, if the number of pending incoming exceeds MAX_INCOMING = 1^16, endpoint automatically responds with refuse packet.
Upon receipt of an incoming-associated packet, if the number of buffered packet bytes for that incoming exceeds the per-incoming limit calculated as min(receive_window, (max_concurrent_bidi_streams + max_concurrent_uni_streams) * stream_receive_window), endpoint drops the packet.
Upon receipt of an incoming-associated packet, if the number of buffered packet bytes for all incoming collectively exceeds MAX_ALL_INCOMING_BUFFERED = 100 MiB, endpoint drops the packet.

gretchenfrage commented 2 months ago

I'm not sure how to debug the freebsd CI failure, I don't see logs for it

Ralith commented 2 months ago

That's weird, normally it logs just fine. GitHub infra flake, perhaps? I've restarted it manually.

gretchenfrage commented 2 months ago

Thanks. It passed this time, so I guess it was just a spurious CI failure.

gretchenfrage commented 2 months ago

Should we expose the new limits on ServerConfig? 100MiB makes sense for a lot of applications, but someone might want to use Quinn in a memory-constrained environment (e.g. a router).

Good idea. Now that this is all happening in proto, there's no friction to us doing that. Added three new settings to ServerConfig: max_incoming, max_buffer_bytes_per_incoming, and max_buffer_bytes_all_incoming.

What if a connection is configured for use with application datagrams (or some other future messaging primitive) only, and stream limits are hence set to zero? Might be simplest to only use a dedicated limit.

I made it so the default ServerConfig constructor calculates max_buffer_bytes_per_incoming as was previously calculated there. Although I did switch it to just hard-code the overhead to 1200 bytes--let me know if you want me to not do it like that, but it seems more straightforward than the trick with comparison. Anyways, if a user configures the stream limits to zero, but still wants 0-RTT datagrams in excess of 1200 bytes to work more-so than they otherwise would, they can manually set max_buffer_bytes_per_incoming to some appropriately high number. Let me know if you think this is a pitfall sufficiently likely to be worth documenting.

It's confusing that this clause checks after adding the next datagram whereas the first clause checks before.

Rounding up to the next packet isn't necessarily about allowing for overhead (which might vary dramatically according to sender discretion), just erring on the side of being permissive.

If we use < here, then 0 has a more intuitive effect.

I believe these should all have been fixed in this change.

gretchenfrage commented 2 months ago

Thanks for the feedback. My last round of changes was a bit sloppier than I would've liked it to be, especially with me missing the += bug. I think the feedback should be addressed now though.

Reworked the limit checking logic to be based on checked_add in a way I think is a lot cleaner.
Largely rewrote the added config docs.
Made other more minor changes.

gretchenfrage commented 2 months ago

This all seems like substantial complexity. How do we feel about alternative solutions where we force callers to address one Incoming at a time? Do we really need the flexibility of keeping multiple incoming packets in flight at the same time? It feels like a larger change that is sort of flying in here under the radar as a very flexible solution to a problem (the race condition identified in #1820), with the flexibility also causing an increase in fragility.

It's a worthwhile question.

Merely limiting there to being only one Incoming at a time would not let us fully avoid buffering early datagrams not associated with a connection handle. It would allow us to indiscriminately put all early datagrams not associated with a connection handle into a single buffer rather than in per-Incoming buffers, but that doesn't seem like much of a simplification.

To avoid having to buffer early datagrams not associated with a connection handle, we would need to be able make a decision whether to accept/refuse/retry/ignore an incoming connection attempt immediately and synchronously when the endpoint receives the connection-creating packet and before it tries to process any further packets (as the decision affects which subsequently received packets get routed to a connection and which get discarded).

This could be achieved by removing the Incoming API as it exists now and instead just putting some sort of incoming_logic: Box<dyn FnMut(&IncomingConnectionInfo) -> IncomingConnectionDecision> field in the server config which the endpoint can call synchronously. I don't think we should do that.

One example of a situation where allowing multiple Incoming to be in motion at the same time would be useful is if there is an IP block list (or allow list) that's stored in a database but has an in-memory Bloom filter to accelerate it:

while let Some(incoming) = endpoint.accept().await {
    let ip = incoming.remote_address().ip();
    if !block_list_bloom_filter.maybe_contains(ip) {
        task::spawn(async move {
            handle_incoming(incoming.accept().expect("TODO put real error handling here")).await;
        });
    } else {
        task::spawn(async move {
            if !block_list_database.contains(ip).await {
                handle_incoming(incoming.accept().expect("TODO put real error handling here")).await;
            }
        });
    }
}

This would be a situation where allowing multiple Incoming in motion simultaneously, and even allowing the application to make decisions on them in a different order than they were produced, could improve performance and/or attack mitigation effectiveness.

djc commented 2 months ago

To avoid having to buffer early datagrams not associated with a connection handle, we would need to be able make a decision whether to accept/refuse/retry/ignore an incoming connection attempt immediately and synchronously when the endpoint receives the connection-creating packet and before it tries to process any further packets (as the decision affects which subsequently received packets get routed to a connection and which get discarded).

I was thinking we might have accept() take ownership of the Endpoint, storing the Endpoint in Incoming, and "giving it back" after accept/reject/ignore. This would still allow the caller to handle incoming connections asynchronously, but not in parallel (thus avoiding buffering issues). It's less flexible but simpler in the end.

Let's see what @Ralith thinks.

Ralith commented 2 months ago

have accept() take ownership of the Endpoint

This would complicate the async layer considerably. We would need to move the proto::Endpoint out of the quinn::endpoint::State each time a connection is incoming, then restore it and wake the endpoint driver after a decision is made. It's also difficult to allow such an accept to be used concurrently with any other endpoint access (e.g. connect, close, assorted getters). Because typical servers will always be waiting on accept, surfacing such a limitation to the public API is likely to degrade usability.

More importantly, it wouldn't solve anything: between GRO and recvmmsg, we may receive many datagrams instantaneously. One batch might include both the first datagram for an incoming connection and follow-up initial or 0-RTT packets for that same connection. These must be either buffered or (as in the status quo) dropped, and it's most convenient to do so at the proto layer, where we at least have the option to correlate them, discard non-QUIC datagrams, and respond directly for stateless cases.

Finally, if we could suspend receipt of datagrams immediately after receiving a connection's first datagram, that would be at odds with future work to parallelize datagram receipt and other endpoint driver work, which is our major remaining milestone for intra-endpoint scaling.

In sum, the current form of this PR is strongly motivated.

gretchenfrage commented 1 month ago

Comments tweaked. Tests added. As suggested, new tests are based on zero_rtt_happypath test and validate the number of dropped packets. Added to TestEndpoint / IncomingConnectionBehavior the ability to have Incoming be pushed to a waiting_incoming vec to be dealt with later rather than dealt with immediately.

djc commented 1 month ago

Thanks for all the effort here!

quinn-rs / quinn

Fix race condition between 0-RTT and Incoming #1821