Closed cjpatton closed 4 years ago
Thanks for writing this up! We've also been pondering this issue in the context of QUIC. In particular, the status quo (option (0)) requires the handshake and record layer coordinate to implement the trial decryption. This would need to happen separately in TLS, DTLS, and QUIC.
This is especially a nuisance for QUIC as it will typically cross the TLS library’s public API. TLS must configure two different keys for QUIC, then QUIC must do some trial decryption, and it must report back to TLS which keys were used. Public APIs and changes in them require coordination across groups, so ideally such interfaces stay as simple as possible. As in #264, I think an option that stays entirely in the handshake would apply to QUIC better.
That (option (1) or something like it) would be my preference as well.
I would prefer Option 3, or something like it, if trial decryption is not acceptable. Martin Thomson's proposal from March (placing a signal in ServerHello.random) seems like the logical approach to me. There were some concerns about an active attack that can distinguish acceptance from rejection, but that's still much better than a cleartext signal.
To clarify, I think any of (1-3) would avoid the QUIC issues. The deployment concerns of (2) worry me, and the extent to which (1) sticks out makes me a little uncomfortable. I'm interested in exploring option (3).
Here's a suggestion for Option (3), worked out by me and some folks at Cloudflare. (Others might have had a similar idea, I don't mean to take credit!) The indication is a pseudorandom value output by the HPKE state. (If ech
is not accepted, then a random value is used instead.)
Let context
denote the HPKE state shared by the client and server when ech
is accepted. Its output can be treated as pseudorandom, e.g., context.Export("tls13-ech-hrr-key', 16)
in Section 7.1 is treated as being indistinguishable from random.
Protocol flow:
ech
, it uses the inner CH; and if the server rejects or does not support ech
, then it uses the outer CH. If the server accepts, then it adds an ech
extension to its SH with context.Export("tls13-ech-accept", 16)
as the value; if the server rejects, then it adds an ech
extension to its SH with a random, 16-byte string and an "ech_retry" extension to its EE with the updated ech
configuration; if the server does not support ech
, then it proceeds as normal, but MAY mimic ech
rejection.ech
extension with payload context.Export("tls13-ech-accept", 16)
, then it proceeds as if the inner CH was used; otherwise it proceeds as if the outer CH was used, updating its ech
configuration if applicable.Pros
Cons
ech
is used as described in Section 7.4.)Spec changes: Semantics of ech
extension changes; adds ech_retry
(i.e., encrypted_client_hello_retry
) extension.
(FWIW, I think the ServerHello.Random trick is due to @davidben or @dvorak42)
Weighing in: my vote goes to something like (3) as a means of not sticking out.
Before I even finished reading the opening post, I was thinking (3), so that seems like an obvious win. What this does is signals support for ECH at a server, but doesn't indicate anything more. It's an expensive way to signal a single bit, but that's not terrible.
Note that this can be used to indicate support in-principle without any config, because servers (or stacks) that want to join the crowd can always produce this random string. That would increase the number of servers that appear to support ECH, at very little cost.
I'm pleasantly surprised by how much consensus there is for Option (3). (BTW, kudos to the HPKE authors for designing a really nice API!) Does anyone have comments on the suggestion for Option (3) above?
Signalling that the server supports ECH in cleartext seems like a significant loss to me, and I think we can avoid it. For example, if the first ~8 bytes of ServerHello.random were replaced by a MAC of the rest of the ServerHello, keyed from the HPKE context, that would be a tamperproof signal that the inner ClientHello was used.
Signalling that the server supports ECH in cleartext seems like a significant loss to me, and I think we can avoid it. For example, if the first ~8 bytes of ServerHello.random were replaced by a MAC of the rest of the ServerHello, keyed from the HPKE context, that would be a tamperproof signal that the inner ClientHello was used.
This would require security analysis to be sure it's actually safe.
But beyond that, what type of adversary are you considering?
Of course, if every stack produced the 16 byte extension in ServerHello, how is that materially different than 8 bytes embedded in ServerHello.random?
Overloading the SH.random this way is likely to violate assumptions made in existing security analyses for TLS 1.3. @bemasc's suggestion might turn out to be OK, but it would be safer to stick this in our own extension. That point notwithstanding, I'm worried about the broader precedent this could set, since overloading the semantics of CH.random and SH.random got us into trouble in earlier versions of TLS.
I don't know that this is necessarily the case, but unless we need to, avoiding more use of those bits is desirable. We could run out, and with 32 bytes, that's saying something.
I think what's clear is that the SH.random trick requires analysis, whereas the SH extension variant does not. Right?
I think they both need analysis, but the SH.random trick is much more invasive and likely to break things. In addition, if more "users" of the SH.random come along, then we would need to vet the interaction of our extension with theirs. (As @martinthomson points out.)
+1 to option 3, and I tend to think that use of a SH extension is not a big concern.
It's correct that the extension might indicate the use of ECH. But from the perspective of a middlebox, that would always be the case when the client sends an ECH extension. Also, there would be other ways to determine if a large-scale server supports ECH (note: ECH is about hiding a tree in the wood, so it's about the cost of finding such woods). Therefore, in practice there's marginal benefit in making the server support signal indistinguishable.
From @cjpatton
I think they both need analysis, but the SH.random trick is much more invasive and likely to break things. In addition, if more "users" of the SH.random come along, then we would need to vet the interaction of our extension with theirs. (As @martinthomson points out.)
I agree that analysis is needed, but I think using 8 bytes of SH.random is not a weird hack. It's at least a well-understood hack, because it's nicely parallel to the downgrade sentinel, and they never coexist.
From @kazuho
It's correct that the extension might indicate the use of ECH. But from the perspective of a middlebox, that would always be the case when the client sends an ECH extension.
TLS 1.2 middleboxes frequently take action based on the certificate, so ServerHello extensions seem likely to be used as well. If there's a ServerHello extension, I expect that some firewall vendor will offer a checkbox labeled "Block Encrypted ClientHello" based on this extension, in the "Security" section, and some admins will turn it on without understanding what it does. If the local network is normally used to access a small set of services, and none of them support ECH yet, then this will appear to work fine, perhaps for years. Then, if one of those services tries to enable ECH, they'll get angry phone calls from customers who can no longer access the service. For them, ECH will be ossified.
For QUIC, I agree that a visible extension is OK, since it's not too late to get it into 1.0. For TLS, I worry that it is too late.
@bemasc, given that servers can respond with rejection even without actual ECH support, would your concern be alleviated if some servers started rejecting the GREASE ECH extensions in the near future?
Hi thread, since there's largely consensus here, I'm going to start working on a PR for Option (3). I'll post it here when it's ready. Thanks for your input!
@MikeBishop Sure, the sooner many servers become "ECH-aware", the better. However, I expect that conservative institutional services will be very slow to update. In the extreme case, if a network is only used to access one service, then the broader ecosystem has limited direct impact.
I don't mean to claim that this ossification is inevitable, but I'd prefer to reduce the risk if we can find a reasonable alternative.
@cjpatton Regarding your second-extension proposal above, it seems to me that there are two options here. In the one you wrote ("ech_retry"), the second extension (empty from the client) is paired with the retry configs. Syntactically, this makes the retry functionality optional: a client that doesn't support retry could omit the "ech_retry" extension.
An alternative would be to pair the retries with the "ech" extension (as in the current draft), and pair the context tag response with a new empty extension. I would call it "context_tag". Syntactically, this makes the context tag functionality optional: clients that support trial decryption could omit it.
I think the "context_tag" arrangement is preferable. It's not explicitly connected to ECH, which I think makes it more likely to be implemented and less likely to be blocked. In principle, it could be used for any future situation where the cryptographic context is ambiguous. It would also give clients the option to use trial decryption in TLS/TCP, and avoid it for QUIC. (Support would be mandatory for ECH and QUIC servers, but optional for clients.)
@bemasc, this sounds like an alternative to Option (3) that makes the explicit indication of acceptance optional. Just so we're on the same page, we think this is what you mean (@chris-wood and I really like this idea, if this is in fact what you intend):
Protocol flow:
ech
, it uses the inner CH; and if the server rejects or does not support ech
, then it uses the outer CH.
ech
:
ech_context_tag
extension to its SH with context.Export("tls13-ech-accept", 16)
as the value.ech_context_tag
extension to its SH with a random, 16-byte string.ech
extension to its EE with the updated ech
configuration.ech
, then it proceeds as normal, but MAY mimic ech
rejection.ech_context_tag
extension with payload context.Export("tls13-ech-accept", 16)
, then it proceeds as if the inner CH was used; otherwise it proceeds as if the outer CH was used, updating its ech
configuration if applicable.I think that's about what I mean, but I'm not clear on what you're saying the client would send. Here's what I was thinking:
ech
extension is not changed at all from the current draft.context_tag
. In the ClientHello, it is empty. The response (in the ServerHello) contains 16 random bytes by default. If the server is using an ECH context, the value is context.Export(<constant>, 16)
instead.ech
and context_tag
. The client should include context_tag
in all their ClientHellos (inner and outer, ECH and GREASE) or none, for a given protocol (TCP or QUIC).ech
must also support context_tag
. All other servers should support context_tag
(which is trivial).context_tag
. (Possibly not something we can specify in this draft.)This seems fine, except that it doesn't make sense to offer the context_tag
extension without the ech
extension, since the context_tag
response is derived from the HPKE state.
I'm suggesting that the client "requests" a context tag in its ech
extension. In particular, there's a flag in the extension that is "true' if it requests a tag and "false" otherwise. Doing this in a separate extension is fine, but you wouldn't want to offer that extension without also offering ech
.
I'm suggesting that the client "requests" a context tag in its ech extension.
I'm not sure this is allowed:
Implementations MUST NOT send extension responses if the remote endpoint did not send the corresponding extension requests... Upon receiving such an extension, an endpoint MUST abort the handshake with an "unsupported_extension" alert.
I also think this formulation is clearer from the perspective of a server that does not implement ECH.
BTW, here's a variation that's even simpler, and might work better with split mode:
context_tag
is empty, the server responds with 16 random bytes.context_tag
in the outer ClientHello, and one containing 16 random bytes in the inner ClientHello.I also think this formulation is clearer from the perspective of a server that does not implement ECH.
Yup, I agree! This will be what the PR does.
PR is underway, I just need to revise the client and server behavior. Should be done tomorrow!
Another possibility is that in the ECH message the client sends a FLAGS extension with a bunch of bits set, and the server responds with a FLAGS extension that has one of those bits set. The plaintext CH could have a superset of the ECH flags.
@richsalz Right, that would the flags-encoded of option (1). It does, however, stick out.
I thought my "plaintext CH could have a superset of the ECH flags" handled the sticking out part.
Oh I see. Sorry, I misunderstood. Though that seems to also stick out: you can tell by just checking the flags for the superset, etc., rule or however we decide to encode it.
I'm not pushing on this very hard, but if you grease flag-bits in the CH and pick one of those bits in the ECH. Maybe that doesn't work, so I'm willing to let this drop.
My model of the implementation here is that the server is made up of two components in a stack:
ECH-unwrapping portion of the server
Backend service
These two components communicate over the network, or in the degenerate case, are in the same machine.
My understanding is that Option (1) a) requires there to be signaling between the front-end and the back-end, and b) requires changes to the backend service to support ECH and the signaling
This additional complexity seems hard to justify and could hinder many deployment scenarios (especially ones we haven't thought of yet). What is the proposed mechanism for signaling whether an inner client hello was used or an outer client hello? This eliminates the ability to use existing TLS stacks without modification, and on top to support a signaling layer. Option (1) seems fraught.
It seems like Option (3) does not require signaling (signaling would come in-handshake from the new extension), but it does require backend changes to implement the new extension. This wouldn't work in situations where a third party doesn't have any motivation to implement ECH-related extensions, a tough loss, but not fatal. The problem here is that the combination of the presence of the client extension (signaling client support for ECH) and the server extensions (signaling server support for ECH) is going to make ECH-enabled traffic stick out from non-ECH-enabled traffic, even if dummy ECHs are sent 100% of the time from supporting clients. The ossification risk is huge.
Option (2) is my strong preference because it doesn't require backend changes and doesn't stick out unless ECH fails to be decrypted (which is a degenerate case).
It seems to me that we are trading off between three things:
As I understand the situation we have (Y is good in each column)
Can't detect ECH-in-use Split-Mode Easy Client Impl
Current Y Y N
Option 1 N N Y
Option 2 Y* Y ?
Option 3 Y N Y
Unless I misunderstand, Ben's proposals are just different spellings of (3).
I don't think anyone likes 1 so that leaves us with Current, 2, and 3.
I'm not that concerned about the "implementation" issues of Current, but I am quite concerned about the QUIC/DTLS issues, so I think we should strive to avoid that. That leaves us with 2 and 3.
As I understand it, the complexity issue is that if the server has turned off support for ECH entirely, then the connection will hard fail. However, with the current state of the spec I believe that this is effectively the state anyway. Here's the relevant text:
Note that authenticating a connection for the public name does not authenticate it for the origin. The TLS implementation MUST NOT report such connections as successful to the application. It additionally MUST ignore all session tickets and session IDs presented by the server. These connections are only used to trigger retries, as described in {{handle-server-response}}. This may be implemented, for instance, by reporting a failed connection with a dedicated error code.
IOW, we don't try to recover for cases where the client-facing server has just forgotten about ECH but rather where it has forgotten the ECH key. So, hard fail is actually not a problem here. I'm similarly not worried about signaling that the server is ECH-capable (especially in this case where it's actually not). So, I think my ? under "Easy Client Impl" turns out to be a Y.
With that said, I do find Option 3 aesthetically cleaner, though I'd struggle a bit to explain why and there is an obvious appeal to not having any public signaling in terms of being able to analyze it, though I'm not sure how strong an argument that is. I'm also not that worried about the sticking out piece, as I would expect that we can rapidly make a lot of servers accept and pretend do respond to grease-ECH -- which we would want to do in any case.
I'd like to hear from some other people about why they prefer (3) to (2) and whether there is some split mode-compatible variant of (3).
Option (2) means ECH becomes much riskier to deploy. In options (0), (1), or (3), a service can advertise ECH in the DNS and not largely panic about inconsistencies between hard-to-predict DNS cache behavior, and hard-to-predict future or present rollout decisions on individual servers.
The only hard commitment is that the servers advertised in DNS are able to speak on behalf of the public name. As long as that that's true, the client has an authenticated signal to recover from any DNS / server mismatches. In particular, if there is a problem and the service needs to roll back ECH (or if the service was in the process of rolling out ECH and missed a few spots initially), the existing TLS server behavior will be correctly interpreted by the client as an authenticated rollback, and the client can recover. This is true for every TLS feature I can think of to date: it is always safe to rollback. With the exception that proves the rule being 0-RTT, which got a note in the spec (RFC8446, appendix D.3.) to describe client behavior to restore safety.
Option (2) breaks this invariant.
@davidben. Sorry, I had misread the specification and I agree with you. To recap for those following at home presently the client's algorithm is:
1. Trial decrypt as if ECH accepted. If success, then proceed.
2. Trial decrypt as if ECH rejected. If failure, then abort.
3. If retry_keys is present, then restart with ECHConfig == retry_keys
4. If retry_keys is absent, then retry without ECH
It's point 4 that is relevant here. I had read the spec as requiring a hard failure, but it in fact recovers. That pushes me more towards (3).
I'd like to hear ... whether there is some split mode-compatible variant of (3).
I believe I described one above. I'll rephrase, in case it was unclear:
context_tag
is an extension that appears in ClientHello and ServerHello.context_tag
with empty contents, the server replies with a context_tag
containing 16 random bytes.context_tag
echoes the contents of the ClientHello's context_tag
.context_tag
in the outer ClientHello, and a context_tag
containing 16 random bytes in the inner ClientHello.@bemasc this is only partially split-mode compatible in that it requires changing every origin server. It just doesn't require coordination between the servers.
True! For any extension implementing Option 3, I think we want those changes anyway. If a ServerHello extension is only implemented by ECH terminating servers, then its presence distinguishes real ECH from GREASE. If it's widely implemented, then that signal is diminished.
Alternatively, we could say that the new extension is only for QUIC, and make TLS/TCP stick to trial decryption.
Sure. I'm comparing it to (2).
Hi all, see PR #283 for our proposal for Option (3), with @bemasc's improvements incorporated. The main points:
I don't understand why this is better than (3). Won't every client just send ech_confirm, in which case this is isomorphic to 3?
Option (0) sticks out less than option (3), which is why a client might opt to not send "ech_confirm".
Hmm ... I think there should be away to resolve the deployment issues between (2) and (3). Will post on Monday.
(3') This seems strictly worse than either (0) or (3). We'll have an odd mix of people doing one or the other and every server will have to do and test both.
@ekr:
(3') This seems strictly worse than either (0) or (3). We'll have an odd mix of people doing one or the other and every server will have to do and test both.
You're suggesting that confirmation SHOULD NOT be optional, correct? I'm Ok with this, but it seems to me that it's not especially complicated to implement this correctly on the server side. The hard bit is the client, since it has to do trial decryption.
@davidben:
Option (2) means ECH becomes much riskier to deploy.
I'd like to drill down on the problem of rolling back ECH. The essence of the problem is that an ECH server that advertised a configuration in the past must support ECH for as long as that configuration is valid. What are some "bad" events that may lead to this contract being violated?
@cjpatton
I agree that the first issue is fine by option (2). (Though probably the mitigation would be to roll out a new key. One hopes that pipeline is already built out and regularly exercised by routine key rotation.)
I'm worried about the second one, but I think the characterization is too simple. TLS implementations are part of a complex system, both within the service and on the internet as a whole. Complex systems break unpredictably. Maybe it's a TLS bug. Maybe ECH inadvertently broke some assumption in some other part of the stack. Maybe some large client had a bug that only triggered due here to some quirk of the server. Maybe some printer happened to be using the code point and now breaks. Maybe it had nothing to do with TLS at all, but some other concurrent server change broke and the entire release needs to be rolled back.
Anything which interferes with the default response (rollback to a known good configuration) is expensive and risky. This risk needs to be communicated across a long game of telephone from...
This is not practical, especially if we want ECH to be widely adopted.
To the alternatives you list, when things go wrong, the priority is to get the service working again. Leaving it broken until the ECH config expires is thus not great. Moreover, expiry itself is a property of a complex system (DNS), so it may not be clear when it actually expires. Mandating a client retry on decryption failure is more plausible (compatible with rollback), but it relies on caching properties of the DNS, which is where much of the deployment mismatch risk comes from in the first place.
Mandating a client retry on decryption failure is more plausible (compatible with rollback), but it relies on caching properties of the DNS, which is where much of the deployment mismatch risk comes from in the first place.
Can you elaborate on this issue a bit more? Suppose the DNS and ECH provider are the same entity, and suppose that entity can synchronize the DNS response with the rollback. I guess one potential pitfall is that the client could use a DNS response cached by its operating system? Any others?
Suppose the DNS and ECH provider are the same entity,
That's a simplifying assumption and doesn't always hold. Even within an enterprise, it's not uncommon for the DNS folks to be a separate group from those running the webservers.
In the current spec, the server provides no indication of whether the inner or outer ClientHello (CH) was used. This means the client must do trial decryption to make this determination, which creates complexity and potentially raises security concerns. As such, it would be useful to explore possible alternatives. In order to drive the discussion, I'll provide a few simple alternatives below, which we can refine as folks provide feedback. (The current spec, draft-07, is listed as option (0) for comparison.)
Besides implementation complexity, one of our design considerations is ensuring that middleboxes don't ossify on ECH. As such, indication of ECH usage should "stick out" (see draft-ietf-tls-sni-encryption, Sec 3.4) as little as possible.
For our purposes, "do not stick out" means a middlebox who observes connections between the client and the client-facing server can't distinguish between real ECH and "dummy" ECH (i.e., a "GREASEd" extension, as described Section 7.4). We assume the middlebox doesn't know the ECH configuration or the public-facing name. (Note that this rules out adversaries such as the GFW, which can actively probe to discover this information.)
Option (0): Do not indicate usage
Protocol flow:
ech
(i.e.,encrypted_client_hello
), it uses the inner CH; and if the server rejects or does not support ECH, then it uses the outer CH. It proceeds with the handshake as normal, except that in case of rejection, it sends anech
extension in its EE with the updatedech
configuration.ech
configuration if applicable.Pros
Cons
Spec changes: None.
Option (1): Publicly indicate acceptance
Protocol flow:
ech
, it uses the inner CH; and if the server rejects or does not supportech
, then it uses the outer CH. If the server accepts, then it adds an emptyech
extension to its SH; if the server rejects, then it adds anech
extension to its EE with the updatedech
configuration; and If the server doesn't supportech
, then it proceeds as normal.ech
extension, then the client proceeds as normal, assuming the inner CH was used; otherwise, the client proceeds as if the outer CH was used, updating itsech
configuration if applicable.Pros
Cons
Spec changes: Semantics of the
ech
extension changes; changes are needed to accommodate "Split Mode".Option (2): Publicly indicate rejection
Protocol flow:
ech
, it uses the inner CH; and if the server rejects or does not supportech
, then it uses the outer CH. If the server accepts or does not supportech
, then it proceeds as usual; and if the server rejects, then it adds anech
extension to its SH with the updatedech
configuration.ech
extension, then the client proceeds as if the outer CH was used and updates itsech
configuration; otherwise, the client proceeds as if the inner CH was used. Decryption failure indicates either that the server does not supportech
(i.e., outer CH was used) or the connection is under attack.Pros
Cons
ech
to a server that has turned off support for the extension, then the connection will fail hard, as the client assumes lack of signal means thatech
was accepted. (We could ameliorate this problem, at the cost of added complexity on the client side implementation.)Spec changes: Semantics of the
ech
extension changes;ech
configuration update is sent in the clear. (We could avoid this by sending the new configuration in a new extension in the EE.)Option (3): Privately indicate acceptance
It may be worth considering an alternative to Option (1) that doesn't stick out as much. Namely, it's possible to make
ech
acceptance in the SH indistinguishable fromech
rejection.