Closed davidben closed 4 months ago
https://github.com/tlswg/tls13-spec/issues/1341 is an obvious implementation error, I don't think we need explicit language for it
That said, I think that you need to describe what happens when you do receive a second update request.
Well, it's a little tricky. On the receiver side, you probably queue up the KeyUpdate response immediately after you saw the KeyUpdate request. So that means you observe the second KeyUpdate request after you've sent the response. So it's indistinguishable from a very very fast network and a peer that implemented a policy of "request KeyUpdate after every 1 records".
It's just that that's a stupid policy and should definitely trip DoS limits. The problem is the server misbehavior makes otherwise reasonable policies behave even worse than the stupid policy.
Without this, you can make some very useful simplifying assumptions in implementations.
What kind of? You need to process the messages sequentially anyway, I don't see how doing lookahead is making the code simpler...
One assumption you can make is that sending you a bunch of KeyUpdate requests in a row is unreasonable and can be bucketed under DoS limits.
This is exactly the problem we've been having with the buggy implementation. A TLS implementation that says "I will request the peer update after 100MB (to pick an arbitrary number) on the same key" is reasonable given the mechanisms in the spec. If it's reasonable, DoS limits should admit it.
However, if a valid implementation of this policy is to request KeyUpdate after every record, it is not possible to admit that without also admitting DoS in channels with a high BDP, or highly asymmetric send rate. This puts the TLS stack in a bind.
Instead, we can observe that there is no reason to ever request a KeyUpdate if you already have one in flight. The peer isn't going to see your new one until it sees your previous one. If you haven't gotten a response yet, sending a new one won't help. (The peer might not have anything to write right now, or the other channel may be slow or blocked on flow control.)
Sending that extra KeyUpdate is truly pointless and, as you say, an obvious implementation error... once one points it out. Yet we know for a fact that it was not obvious when not mentioned because a major TLS implementation didn't think to handle this! Thus we should mention it in the spec.
PS: Although it is tricky for an implementation to directly observe this (except when it has gone so wrong as to hit DoS limits), it is obvious how to write a test for this behavior in a test stack. Just make the test stack indefinitely defer the KeyUpdate response, simulating a huge BDP, and assert you don't get a new KeyUpdate request despite continuing to trip whatever threshold the TLS implementation picked.
I don't think "DoS" is enough of a reason for "MUST" language. At most, I think it should be a "MAY".
Instead, we can observe that there is no reason to ever request a KeyUpdate if you already have one in flight.
depends how you define "in flight", if it's TCP in flight, then maybe, if it's "TLS in flight" (as in, not read and interpreted by the TLS implementation), then no, there is no way for the sender to know when it was read. And with modern Internet network speeds resulting in hundreds of megabytes of data actually in flight, having multiple KeyUpdate messages in flight is reasonable.
So, either the text should describe the situation much more precisely, or it should spell out the whole issue and make recommendations how to handle it.
That faster internet speeds allow for a higher BDP, which is exactly why there's a risk of problems with multiple update_requested KeyUpdates in flight. There is no reason for a TLS stack request a KeyUpdate from the peer if it has already sent one and the peer hasn't reacted. It is not reasonable to have multiple update_requested KeyUpdates in flight.
I noticed you wrote "having multiple KeyUpdate messages in flight" without the qualifier. Perhaps that is what you are missing here. There are two kinds of KeyUpdates in TLS 1.3. If you're simply updating your own stream (update_not_requested), yes having several in flight is reasonable. That is not what this PR is talking about. When you specifically are requesting that the peer update the key, it does not make sense to send one when the previous one hasn't been acted on.
If your stack is simply sending a KeyUpdate every few MB, this PR does not impact you. Those will be update_not_requested. It's the update_requested ones that are relevant, when the stack is requesting the peer update, often based on counting the peer's byte count.
Yes, I am aware we're talking about update_requested
ones.
I'm talking about a situation where the client can anticipate that a key update will be necessary for the server sending side. If I can have 200 MiB in flight, and I want a key update from the server every 100 MiB, then I need to have up to two update_requested
key updates in flight.
So maybe specify it explicitly like that? "There must be no more than n+1 KeyUpdate in flight with no response, where n is the size of in flight data divided by the intended rekey interval"?
If I can have 200 MiB in flight, and I want a key update from the server every 100 MiB, then I need to have up to two update_requested key updates in flight.
You want a key update from the server for every 100 MiB that you send? How can you know anything about the amount of traffic that the server is going to reply with (in any kind of generic case)?
Regardless, the server's KeyUpdate policy is ... the server's. Not the client's. Can you present a scenario where there is a strong requirement for the server to KeyUpdate for every 100 MiB of data the client sends, that cannot be handled by configuring policy on the server?
200 MiB and 100MiB is an example, consider it to be 4 GiB and 2GiB respectively if that makes it more reasonable frequency to ask for KeyUpdate
How can you know anything about the amount of traffic that the server is going to reply with (in any kind of generic case)?
I won't. But the point of the base TLS specification is to handle all the use cases, not an application-specific use case. And we don't need to search far to find examples where the client may know very well that the file it's downloading over HTTPS is few terabytes large.
Regardless, the server's KeyUpdate policy is ... the server's. Not the client's.
if that was the case, then we wouldn't have update_requested
At the point that your BDP exceeds the limits on your AEAD, you really need the sender to handle this. On the other side, you cannot know what the sender will do in the future anyway. Really, you need to get a new AEAD. While TLS gets to play on easy by running over TCP, this same design in all its sibling protocols is stricter. Too many unacked key updates and you can't even reconstruct the epoch: https://www.rfc-editor.org/rfc/rfc9147.html#section-8 https://www.rfc-editor.org/rfc/rfc9001#section-6
As for why we have update_requested
at all, I was actually the instigator for why KeyUpdate looks the way it does. 😛 This scenario was honestly always pretty fuzzy. The history here is that the original KeyUpdate did not have two modes at all. Instead it was a symmetric scheme where the two sides were forced to catch up to the same number of updates at all points. In fact the channels' traffic secrets weren't separate ladders as they are now. There was one traffic secret for the two, and we KDFed "client key", "server key", "client iv", "server iv" secrets, so it had to be lockstep. In that model, it doesn't make sense to wonder whether you're updating for yourself or the other side because it's symmetric anyway.
I observed that this design did not accommodate asymmetric sending rates at all and resulted in a DoS problem. In extreme case, protocols over TCP may have one direction sending data while the other direction isn't running at all. (Indeed it's quite comment. Consider HTTP/1.1. An HTTP/1.1 client does not read while it is sending a request. An HTTP/1.1 server does not read while it is sending a response.) To fix this, we split the two ladders, at which point sides could just unilaterally update themselves. And, honestly, we could and probably should have left it at that.
However, unilateral-only updates loses the kinda speculative "I know more about this AEAD than you" case. There was some discomfort losing that, so this "update_requested but the responses may be coalesced" design was born. But this case was always fuzzy and it was always understood that, in the limit, you really have to rely on the sender. Just a goofy belt-and-suspenders thing.
Indeed the extreme example where sides write without reading is precisely where you need this limit. If you don't, the side requesting KeyUpdates will not only DoS the peer, it will also DoS itself. If I'm sending you a blast of data (consider an HTTP or SQL request that results in GBs of data), but not reading because it's not my turn to read, you might try to request KeyUpdates. If you try to be clever and clock them, you'll add more and more data to the pipe and at some point the KeyUpdates themselves will exceed the transport buffers because I'm not reading at all. At this point:
When I'm finally done sending, if you haven't managed to DoS yourself first, the torrent of update_requested
s finally releases and now you DoS me instead because I'm suddenly draining a ton of data and churning the KDF.
This isn't theoretical. We saw this problem in the wild with a major client. We never finished patching the original KeyUpdate design for asymmetric rates. This PR is the last necessary step to do so. Limiting to send_rate / bdp
does not make sense because if one side isn't reading, the BDP does not figure into it.
Sure, if the sending side is not reading, then the client has no way of influencing its behaviour, thus sending KeyUpdates asynchronously is pointless.
I'm not talking about this situation though. I'm considering a server that does the reading and writing asynchronously.
What it sounds like to me, is that we basically need to define packet limits for the different AEADs, and require the sender to send KeyUpdate before they are reached, unprompted by the other side. Once we have this kind of requirement, we could either drop the update_requested
completely, or indeed make it so that only one unanswered one can be in flight (and document that this is specifically to prevent deadlocks with synchronous implementations).
@tomato42 the kind of change you are proposing is out of scope for 8446-bis, which is largely about clarification, not redesign. If you want to propose this, you should write a new draft.
@davidben I think what you are proposing is potentially still in scope, but would need WG consensus, as this document has already been WGLCed. Can you send email to the list to alert the chairs.
@ekr to be clear, I'm trying to build an alternative solution, I'm not insisting on it. For the text itself, I'd be OK with a "SHOULD NOT" in place of "MUST NOT".
"SHOULD NOT" wouldn't allow for enforcement, so I'm opposed to that.
I'm not sure about a MUST NOT here. This would require additional state tracking that currently is not required (for example with an application API to trigger peer updates). I'm also not sure this is possible to enforce without risking interop problems, since you have no way of knowing whether a received key update is actually in response to the update_requested that you sent.
The spec already allows for responding to multiple update_requested with a single KeyUpdate, so I don't see how this is a DoS problem for the peer.
This would require one bit of state "A Key Update is outstanding" - that's not really that challenging.
Responding to multiple key updates in a coalesced single update does not help in the situation described, since once you send the response, you have no idea in the misbehaving client sense how backlogged the client is on input before they will see your response. In a DOS situation, you are still faced with attempting to keep enough state to differentiate this bad behavior from an attacker sending constant update requests. While this change requires the sender to keep one bit of state this means that the recipient also needs to only keep one bit of state to prevent a DOS, (instead of an arbitrary limit chosen try to allow legitimate clients while not getting DOS'ed) and makes the protocol simpler all around
Hence I support MUST NOT.
@ekr this PR is ready to merge; see mail.
Otherwise we run into the issue described in issue #1341.
Fixes #1341