Closed cjpatton closed 4 years ago
Even when they are the same entity, I don't think synchronizing the DNS response with the rollback is meaningful. As you note, the OS has a cache. The client may have a cache in front of that. The recursive resolver may have a cache. There are probably other caches in various DNS middleware. There is also no guarantee that the same server instance and will serve the client's DNS query and the client's TLS connection, which means that during the course of a rollout or rollback, there will be mismatches.
Turtles caches all the way down!
There is also no guarantee that the same server instance and will serve the client's DNS query and the client's TLS connection, which means that during the course of a rollout or rollback, there will be mismatches.
Er, by "mismatch" here, I mean that the client will see DNS and TLS configs from different generations. If a very careful server operator carefully controls changes based on TTLs and deployment times, they may be able to arrange for all observed cross-generation configs to be compatible. (And indeed they should arrange for TLS servers to know about ECH keys before advertising them, etc.)
But this is the careful, happy case, not a failure recovery case.
That's a simplifying assumption and doesn't always hold. Even within an enterprise, it's not uncommon for the DNS folks to be a separate group from those running the webservers.
Very true, and I don't think we should assume this is the case. I think (3) is the best option for many deployments, but there are use cases for which (2) is much better, assuming the server can manage the DNS/TLS synchronization complexity. It might be worth exploring a hybrid approach: in its ECHConfig, the server might indicate what it confirms: acceptance a la (3), or rejection al a (2).
The main concern I have with (2) is that the client needs to evict its cache before making the DNS request, and I'm not sure how platform-dependent this behavior is.
The client can't evict the recursive resolver's cache. I'm not sure if even the OSes provide APIs to clear caches. (I don't see an obvious flag to pass into getaddrinfo
.)
We also should not have two different spellings of the same thing in the protocol. It is complex enough as it is.
So the only way to safely rollback for option (2) is to wait until the DNS record expires. Does anyone have a sense of the degree to which clients respect the record's TTL? Google measured clock skew among Chrome clients years ago and found it was pretty dismal. Is the state of affairs any better today?
In any case, counting on clients to get DNS right appears to be risky. If we go with (2), then it seems the best option on the table so far is to use trial decryption to distinguish between ECH acceptance and (the unlikely case of) ECH rollback.
Let me make one more pitch for (something like) option (2). As @grittygrease pointed out, we have largely ignored a potentially important "don't stick out" consideration. The goal of (3) is to make connections from a real ECH client to an ECH server look like connections from a dummy ECH client (i.e., one that sends a GREASEd extension) to an ECH server. A property that (0,2) have that (1,3) don't is that connections from a real ECH client to an ECH server look like connections from a dummy ECH client to a non-ECH server. In other words, options (1,3) don't provide covertext for non-ECH servers, whereas (0,2) do. (ECH rejection sticks out for (2), but the happy path doesn't.) Do we regard this as a risk to deployment?
In any case, counting on clients to get DNS right appears to be risky. If we go with (2), then it seems the best option on the table so far is to use trial decryption to distinguish between ECH acceptance and (the unlikely case of) ECH rollback.
If we do that, we haven't addressed this issue. If clients still need to implement trial decryption for one case, however unlikely, we're still paying for it and there's no point in building a separate thing. How common a codepath is affects performance considerations, but not complexity considerations. The issue with trial decryption is complexity, not performance. (Trial decryption also breaks some in-place decryption strategies, so there can be a performance concern too, but it's just one record so I'm not concerned about that.)
A property that (0,2) have that (1,3) don't is that connections from a real ECH client to an ECH server look like connections from a dummy ECH client to a non-ECH server. In other words, options (1,3) don't provide covertext for non-ECH servers, whereas (0,2) do. (ECH rejection sticks out for (2), but the happy path doesn't.) Do we regard this as a risk to deployment?
Right, I think this is the ServerHello.random vs. new extension question for (3). Sticking the indicator in ServerHello.random makes the full cross product of {ECH-client, GREASE-client} x {ECH-server, non-ECH-server} look the same, provided the server supports TLS 1.3. This is nice, but it's a weird one-off trick we can't do again. Sticking the indicator in a new extension also makes the same cross product look the same, provided the server supports TLS 1.3 and has been updated to send this extension. It can send this extension independent of ECH support, but it's not a thing anyone does today because the extension doesn't exist, so the deployment curve will be different.
In contrast, (2) is missing coverage. It makes the following tuples look the same: (ECH-client, ECH-server), (ECH-client, non-ECH-server), (GREASE-client, non-ECH-server). It misses (GREASE-client, ECH-server). In particular, clients may be ECH-capable (and thus know to send GREASE extensions) but not configured with a DoH resolver and unable to get HTTPS records over Do53 (either due to cleartext problems or ossification).
The issue with trial decryption is complexity, not performance.
Agreed, I'm just reiterating that we haven't solved the problem with (2) if we can't solve the problem with client-side DNS.
@davidben, do you expect clients to send GREASE on all connections or only connections for which DoH is available? If you expect clients to send a dummy ECH in situations where the ECHConfig is potentially unavailable, do you expect the server to send ECHConfig back in the handshake and the client to restart the handshake? That seems like a pretty big performance hit.
do you expect clients to send GREASE on all connections or only connections for which DoH is available?
I think they should send it for all connections. That was a big part of the motivation.
If you expect clients to send a dummy ECH in situations where the ECHConfig is potentially unavailable, do you expect the server to send ECHConfig back in the handshake and the client to restart the handshake? That seems like a pretty big performance hit.
No, clients don't process retry configs on GREASE connections.
Offering a GREASE extension is not considered offering an encrypted ClientHello for purposes of requirements in {{client-behavior}}.
Possibly the spec should be clearer here. The intent is that this is a different mode altogether. (Probably the business around sessions remembering whether ECH was negotiated can be dropped too now that we encrypted the whole ClientHello. That was originally added to work around some goofiness between the public and private names. Edit: filed https://github.com/tlswg/draft-ietf-tls-esni/issues/285)
That was an intentional limitation in at least the first iteration of the retry flow. Picking up a retry config without a DNS lookup is odd for several reasons. As you note, there is a performance penalty to the retry. More importantly, the client has already leaked the name at that point. It'd really only be useful for subsequent connections and the text intentionally only applies the retry to one connection attempt. Trying to solve it for subsequent connections would be interesting, but there are several nuisances to resolve:
Given all that mess, I omitted it from the PR when proposing this mechanism and figured we'd think about these issues later if the WG wanted to pursue a non-DNS flow.
The DNS expiration complaint seems like overthinking a bit.
Step 1: Update Registry to remove DS Step 2: Wait until DNS caches expire Step 3: Remove zone keys (KSK, ZSK, RRSIG, etc.)
This is done pretty frequently, and the servers take the risk of the site having an outage if the client has record synchronization issues.
In fact, RRSIG records have explicit expiration times, which makes them less flimsy with respect to expiration. If we follow the lead of RRSIG and add an expiration time to ECHConfig, then we're only relying on clock synchronization during rollover rather than DNS cache expiration.
How about: 1) add a time box to the ECHConfig record 2) recommend only sending GREASE in the same situations as 10.2. describes: when you expect to reliably get the ECHConfig record if it exists (i.e. DNSSEC or DoH)
In fact, RRSIG records have explicit expiration times, which makes them less flimsy with respect to expiration.
I think this is not quite right. When all RRSIGs in the zone are expired, the status is 'Bogus', not 'Insecure'. In other words, DNSSEC fails hard when the validation expires, and relies on caches to respect TTL. This is a security feature to prevent an attacker from resurrecting expired data. This arguably supports your overall argument, but not your proposed mitigation.
From this discussion, it sounds like trial decryption (Option 0) is only modestly inconvenient for TLS/TCP. If so, that makes me think that we should focus on a simple, separate Option 3 extension only for QUIC, and keep TLS/TCP at Option 0.
Hi folks, in order to help drive the discussion, I've created PRs for the options currently being discussed.
From this discussion, it sounds like trial decryption (Option 0) is only modestly inconvenient for TLS/TCP. If so, that makes me think that we should focus on a simple, separate Option 3 extension only for QUIC, and keep TLS/TCP at Option 0.
Based on the discussion on #283, most people seem to not favor supporting this behavior.
Hi all, a quick update for those who haven't been following the proposals:
It seems that consensus is coalescing around #287 because it minimizes deployment coimplexity and sticks out less than #283. The open issue for this change is security.
@chris-wood and I reached out to a variety of people who have worked on security proofs of TLS 1.3 to see how this change might impact their analysis. While this change is significant enough to requires generating fresh proofs, no one expects it to lead to an attack if the confirmation string is sufficiently short. The current proposal uses the last 8 bytes of the SH.random, which leaves 24 bytes of entropy to ensure uniqueness of the session id. I added discussion of this point to the PR... it would be helpful to get more eyes on this.
2 and 3 talk to the same PR?
(edited to remove the email cruft)
Oops, fixed!
To follow up on the comment I made at the mic and then decided didn't work.
Assuming we accept PR#292, and decide the CHInner.Random is secret then can we just say that the ESNI accepted signal is to have the low order bytes of SH.Random be derived from CHInner.Random (copied might work, but hashed would make me feel better). I haven't done any real analysis of this, but it seems like it would not permit an attacker who does not know CHInner.Random to determine whether ECH was accepted.
I think we should consider a construction like Expand(Extract(ServerHello.random[0:24], CHInner.random), "ech-tag", 8)
, i.e. to make the tag dependent on the rest of ServerHello.random
. This would at least partly address @huitema's concern about replays.
I don't see the replay concern as important, since all it does is reveal if ECH was used. There are easier ways for an attacker to learn this information. In fact, it doesn't need to interfere with the connection at all: all it needs to do is learn the ECH configuration.
I think it's best to keep the mechanism as simple as possible. In particular, I'd like to do everything we can to not increase the requirements for the backend server in Split Mode.
Of course, if there is an attack that violates the intended security goal of ECH (confidentiality of CH extensions), then we should take that seriously. But I don't think this change (i.e., #287) increases this risk compared to the status quo.
These attacks aren't part of our core threat model, but it seems like we have an opportunity to defeat some or all of them at low cost, so I think we should consider doing so.
There are easier ways for an attacker to learn this information. In fact, it doesn't need to interfere with the connection at all: all it needs to do is learn the ECH configuration.
This is true in the main deployment models we're discussing, but I can also imagine use cases where the ECHConfig is not available to the attacker.
I think we should consider a construction like
Expand(Extract(ServerHello.random[0:24], CHInner.random), "ech-tag", 8)
, i.e. to make the tag dependent on the rest ofServerHello.random
. This would at least partly address @huitema's concern about replays.
That doesn't work. The attacker just needs to replay the entire server random. If you want protection you need to mix in the server's key share.
@bemasc
These attacks aren't part of our core threat model, but it seems like we have an opportunity to defeat some or all of them at low cost, so I think we should consider doing so.
If we're going to go down this road, then I think we need to take a step back and think about our "don't stick out" threat model in more detail. Currently our requirement is that a passive observer, who doesn't know the configuration, is unable distinguish real ECH usage from the "cover traffic" provided by clients who "GREASE" the ECH extension. The attackers mentioned so far are active and may know the config. So let's start here: do we anticipate an attacker this powerful? So far we've mostly been talking about "don't stick out" in terms of dumb middleboxes that we don't want ossifying on our extension. The current threat model captures this pretty well, I think. If we want to go for something stronger, then we clearly need to re-think the design of #287 (or decide we shouldn't do it).
Something to keep in mind is that indistinguishability of the "real" protocol from some "cover" protocol is a property that TLS was never designed to have. It seems to me that the task of endowing TLS with some sort of stegonagraphic security property goes way beyond this one extension. It's an interesting and valuable goal, but one that should be addressed in a more general way.
For ECH, I think we should focus our efforts on coming up with a design that we feel we can deploy today, and iterate and re-deploy as needed.
I agree with @cjpatton that there is value in simplicity. A really stealthy ESNI would be a different design than ECH.
+1 -- ECH is not about censorship circumvention, or being stealthy.
@chris-wood you should maybe expand a bit on that. If you are not trying to defeat some form of censorship, then why are you hiding the SNI in the first place?
The problem I'm focusing on is an "attack" so trivial it could almost happen by accident. If a ClientHello is issued twice, verbatim, and elicits two independent ServerHellos, an observer can see whether the last 8 bytes of .Random are the same in both responses. This might happen sporadically to DTLS or QUIC in some configurations, even without an active attacker.
The formula I proposed above avoids this repetition. If we're going to use a hash, as EKR suggested, this calculation seems like a pretty natural way to do it.
I definitely don't want to slow down progress, and I'm not proposing that we substantially expand our threat model. I do think closing trivial attacks has some value even if more advanced ones still exist. For example, the other attacks may be more difficult or less deniable.
@chris-wood you should maybe expand a bit on that. If you are not trying to defeat some form of censorship, then why are you hiding the SNI in the first place?
Censorship is, for example, active blocking of a connection based on the name, whereas ECH hides SNI (and other things) from those that just passively snoop and try to learn about clients.
EDITED TO FIX PROPOSAL 3.
@bemasc
The problem I'm focusing on is an "attack" so trivial it could almost happen by accident.
I agree that it would be worth mitigating this attack, as long as the mechanism isn't too complicated. Let's consider the "replay protection" properties of the current proposals. Suppose the attacker wants to learn if a client offered ECH, so it replays the ClientHelloOuter to the server. Here are the proposals (please chime in if I got this wrong!)
accept_confirmation = getrandom(8)
accept_confirmation = Hash(ClientHelloInner.random)
accept_confirmation = Hash(ServerHello.random[0:24] + ClientHelloInner.random)
where Hash
is something like Expand(Extract( . , some_salt), some_info, 8)
. (Though since the ikm is a random string, I think it would suffice to just call Extract( . , some_info, 8)
.)
Neither 1 nor 2 mitigates the attack, but 3 does. All options "stick out" the same if the ClientHelloInner is known, e.g., if the adversary is on-path from the client-facing server to the backend server.
Incidentally, proposals 2 and 3 are an improvement over 1 since we don't have to send an extension in the ClientHelloInner. On the down side, the backend server needs to know how to instantiate Hash
, i.e., it needs to know the HPKE cipher suite. We could get around this by using the hash from the TLS cipher suite.
@cjpatton In this shorthand, my proposal is more like Hash(ServerHello.random[0:24], ClientHelloInner.random)
. This avoids the leak that you identified.
Ah, you're right. My apologies! Fixing above.
I'd be fine with 2 or 3, though we should use the TLS cipher suite instead of the HPKE cipher suite so that the backend server doesn't need to know the latter.
@bemasc proposes accept_confirmation = Hash(ServerHello.random[0:24] + ClientHelloInner.random)
My proposal would be: accept_confirmation = Hash(ServerHello.KeyShare + ClientHelloInner.random))
The rationale is that merely hashing the reminder of the server random is insufficient. The attacker could just do the attack I delineated in issue #287 by copying the whole ServerHello.random[0:32]
instead of just copying ServerHello.random[24:32]
. But if you mix the server key share in the hash, then the attacker cannot do that without also copying a key share for which the private key is unknown.
EDITED AFTER DISCUSSION WITH @chris-wood
Roger that. Here's what we have on the table:
accept_confirmation = getrandom(8)
accept_confirmation = PRF(ClientHelloInner.random, "")
accept_confirmation = PRF(ClientHelloInner.random, ServerHello.random[0:24])
accept_confirmation = PRF(ClientHelloInner.random, ServerHello.KeyShare)
Let's instantiate PRF( . , . )
with Expand( . , . , 8)
, where Expand
is for the TLS cipher suite (and not HPKE). Proposal 1 and 2 are for the status-quo threat model, i.e., the "don't stick out" distinguisher is passive; proposal 3 provides additional "don't stick out" protection in case the CH is replayed; and proposal 4 improves on 3 by providing some protection against manipulation of the SH.
My preference is proposal 2, since it simplifies the extension. I would be fine with 3 or 4, though I'm not convinced that either fully addresses the stronger threat model.
Hmm, on second thought I'm not so sure how much simpler 2 is than 1. The ClientHelloInner would still have to carry some sort of indication of ECH acceptance so that the backend server knows to confirm. But an empty "encrypted_client_hello" extension (or maybe a new code point?) would do just fine.
Something weird about 4 is that the backend server has to wait to finish the ServerHello.random until it generates a key share. This might add a bit of complexity, though it depends on the code base.
@cjpatton Yes, incorporating the key share is more complex. But let's look at what we are doing, replacing trial decryption by a hint. Trial decryption generates complexity, especially in the QUIC mapping, but the result is unambiguous and hard to fool. The client knows for sure whether the key was generated from the inner CH or the outer CH, and it is very hard for third parties to partially fool the client. The hint introduces another failure mode, i.e. wrong hint value, and I believe it can be exploited. For protection, the code has to be almost as hard to fool as trial decryption. That's what I am trying to achieve by incorporating the server key share in the mix.
There are of course implementation issues. The server has to know what key share it will use before generating Server.Random
. That may or may not be easy to do, depending on implementation. The KeyShareEntry
value do not depend on the Server.Random
value, so this is definitely possible. But the code path depends on the implementation, and it may be more difficult for some stacks than for others.
HI all, I added a commit to #287 that implements @bemasc's suggestion. Specifically, accept_confirmation
(i.e., the last 8 bytes of ServerHello.random
is computed as
HKDF-Expand-Label(
HKDF-Extract(0, ClientHello.random),
"tls13-ech-accept-confirm",
ServerHello.random[0:24],
8
)
where HKDF-Extract and HKDF-Expand-Label are as defined in RFC8446. Doing Extract-then-Expand ensures that we don't run into any issues with the length of the ClientHello.random not matching the Hash.length in the TLS stack.
Please have a look to make sure it's spelled correctly.
HKDF-Expand-Label
adds a "tls13 " prefix to the label, so I think you can shorten the label.
I agree, we need HKDF-Extract()
for Hash.length > 32
(e.g. SHA-512). Given the need for HKDF-Extract()
, it would seem more natural to me to put ServerHello.random[0:24]
in the extraction salt, and use HKDF-Expand
instead of HKDF-Expand-Label
.
@bemasc
HKDF-Expand-Label
adds a "tls13 " prefix to the label, so I think you can shorten the label.
Good call! Fixing. This reminds me that we need to do a pass of the spec to ensure all the constants have the same structure.
Given the need for
HKDF-Extract()
, it would seem more natural to me to putServerHello.random[0:24]
in the extraction salt, ...
I disagree. In any case, the salt being Hash.length bytes long avoids indifferentiability issues [1].
... and use
HKDF-Expand
instead ofHKDF-Expand-Label
.
What does this buy us?
I'm not familiar with that paper, but Section 4.3 seems to say that HKDF is suitably indifferentiable without any such restriction on the salt length.
Using HKDF-Expand instead of HKDF-Expand-Label would seem to make use of fewer, better-analyzed constructions, but I'm not aware of a practical difference, so if HKDF-Expand-Label is more convenient to implement for some reason then that seems like enough justification.
I'm not familiar with that paper, but Section 4.3 seems to say that HKDF is suitably indifferentiable without any such restriction on the salt length.
There are many "safe" salt lengths. I'm not sure 24 is "safe", but I know Hash.length is.
Using HKDF-Expand instead of HKDF-Expand-Label would seem to make use of fewer, better-analyzed constructions, but I'm not aware of a practical difference, so if HKDF-Expand-Label is more convenient to implement for some reason then that seems like enough justification.
I don't think one is any harder than the other. The only difference between them is that HKDF-Expand-Label
exposes an additional context
parameter, which I think aligns a bit better with what we're doing here.
If you'd like to keep pushing for these changes, then please follow up by making a comment on the PR.
I appreciate the safety concerns, but you are going to extract an 8 bytes hint from the hash. That's a serious step down from 32 or even 16 bytes, and with such a short length I would be really surprised if two different hash constructs resulted in any security difference!
Hahaha, yeah. We need the 8 bytes to be pseudorandom, and I think the current design is defensible from a provable security perspective. We may be able to do a bit better. What do you think of this, @bemasc?
accept_confirmation = HKDF-Extract(ServerHello.random[0:24] + 0^{Hash.len-24}, ClientHello.random)[0:8]
This is valid as long as Hash.len >= 24, which I believe is guaranteed by RFC8446.
That's fine with me, although I'm not sure why you need to pad the salt. (HKDF-Extract will pad it for you.)
(HKDF-Extract will pad it for you.)
Roger that.
Are you happy with this @huitema?
(HKDF-Extract will pad it for you.)
Hmm, looking at RFC5869, it's not clear to me that the salt is padded by this function. I think I prefer the following:
accept_confirmation = HKDF-Extract(0, ClientHelloInner.random + ServerHello.random[0:24])[0:8]
Updated #287 with this change.
I think that's a fine implementation of the suggestion made by @bemasc . I am waiting for the resolution of the "don't stick out" issue on the TLS mailing list.
The decision in today's interim meeting is to merge #287 as-is and reconsider the "don't stick out" threat model later on. In particular, we won't be adopting Karthik's suggestion from the mailing list for this PR. @ekr also pointed out that it could be done as an ECH extension.
Closing now that #287 landed.
In the current spec, the server provides no indication of whether the inner or outer ClientHello (CH) was used. This means the client must do trial decryption to make this determination, which creates complexity and potentially raises security concerns. As such, it would be useful to explore possible alternatives. In order to drive the discussion, I'll provide a few simple alternatives below, which we can refine as folks provide feedback. (The current spec, draft-07, is listed as option (0) for comparison.)
Besides implementation complexity, one of our design considerations is ensuring that middleboxes don't ossify on ECH. As such, indication of ECH usage should "stick out" (see draft-ietf-tls-sni-encryption, Sec 3.4) as little as possible.
For our purposes, "do not stick out" means a middlebox who observes connections between the client and the client-facing server can't distinguish between real ECH and "dummy" ECH (i.e., a "GREASEd" extension, as described Section 7.4). We assume the middlebox doesn't know the ECH configuration or the public-facing name. (Note that this rules out adversaries such as the GFW, which can actively probe to discover this information.)
Option (0): Do not indicate usage
Protocol flow:
ech
(i.e.,encrypted_client_hello
), it uses the inner CH; and if the server rejects or does not support ECH, then it uses the outer CH. It proceeds with the handshake as normal, except that in case of rejection, it sends anech
extension in its EE with the updatedech
configuration.ech
configuration if applicable.Pros
Cons
Spec changes: None.
Option (1): Publicly indicate acceptance
Protocol flow:
ech
, it uses the inner CH; and if the server rejects or does not supportech
, then it uses the outer CH. If the server accepts, then it adds an emptyech
extension to its SH; if the server rejects, then it adds anech
extension to its EE with the updatedech
configuration; and If the server doesn't supportech
, then it proceeds as normal.ech
extension, then the client proceeds as normal, assuming the inner CH was used; otherwise, the client proceeds as if the outer CH was used, updating itsech
configuration if applicable.Pros
Cons
Spec changes: Semantics of the
ech
extension changes; changes are needed to accommodate "Split Mode".Option (2): Publicly indicate rejection
Protocol flow:
ech
, it uses the inner CH; and if the server rejects or does not supportech
, then it uses the outer CH. If the server accepts or does not supportech
, then it proceeds as usual; and if the server rejects, then it adds anech
extension to its SH with the updatedech
configuration.ech
extension, then the client proceeds as if the outer CH was used and updates itsech
configuration; otherwise, the client proceeds as if the inner CH was used. Decryption failure indicates either that the server does not supportech
(i.e., outer CH was used) or the connection is under attack.Pros
Cons
ech
to a server that has turned off support for the extension, then the connection will fail hard, as the client assumes lack of signal means thatech
was accepted. (We could ameliorate this problem, at the cost of added complexity on the client side implementation.)Spec changes: Semantics of the
ech
extension changes;ech
configuration update is sent in the clear. (We could avoid this by sending the new configuration in a new extension in the EE.)Option (3): Privately indicate acceptance
It may be worth considering an alternative to Option (1) that doesn't stick out as much. Namely, it's possible to make
ech
acceptance in the SH indistinguishable fromech
rejection.