Usage indication: alternatives to trial decryption

cjpatton commented 4 years ago

In the current spec, the server provides no indication of whether the inner or outer ClientHello (CH) was used. This means the client must do trial decryption to make this determination, which creates complexity and potentially raises security concerns. As such, it would be useful to explore possible alternatives. In order to drive the discussion, I'll provide a few simple alternatives below, which we can refine as folks provide feedback. (The current spec, draft-07, is listed as option (0) for comparison.)

Besides implementation complexity, one of our design considerations is ensuring that middleboxes don't ossify on ECH. As such, indication of ECH usage should "stick out" (see draft-ietf-tls-sni-encryption, Sec 3.4) as little as possible.

For our purposes, "do not stick out" means a middlebox who observes connections between the client and the client-facing server can't distinguish between real ECH and "dummy" ECH (i.e., a "GREASEd" extension, as described Section 7.4). We assume the middlebox doesn't know the ECH configuration or the public-facing name. (Note that this rules out adversaries such as the GFW, which can actively probe to discover this information.)

Option (0): Do not indicate usage

Protocol flow:

On input of the client's outer CH. If the server accepts ech (i.e., encrypted_client_hello), it uses the inner CH; and if the server rejects or does not support ECH, then it uses the outer CH. It proceeds with the handshake as normal, except that in case of rejection, it sends an ech extension in its EE with the updated ech configuration.
On input of the server's SH, EE, …, Finished. The client determines whether the inner CH or outer CH was used by computing the decryption key for each scenario and attempting to decrypt EE. It then proceeds with the handshake as usual, updating its ech configuration if applicable.

Pros

Sticks out the least.
Is the least complex for servers to implement (same for Option (2)).

Cons

Is the most complex for clients to implement.

Spec changes: None.

Option (1): Publicly indicate acceptance

Protocol flow:

On input of the client's outer CH. If the server accepts ech, it uses the inner CH; and if the server rejects or does not support ech, then it uses the outer CH. If the server accepts, then it adds an empty ech extension to its SH; if the server rejects, then it adds an ech extension to its EE with the updated ech configuration; and If the server doesn't support ech, then it proceeds as normal.
On input of the server's SH, EE, …, Finished. If the SH has the ech extension, then the client proceeds as normal, assuming the inner CH was used; otherwise, the client proceeds as if the outer CH was used, updating its ech configuration if applicable.

Pros

Is the least complex for clients to implement.

Cons

Breaks Split Mode: the backend server must indicate acceptance in its SH but does not know whether the client-facing server accepted or not. (We could ameliorate this problem by adding an indication of acceptance to the inner CH.)
Sticks out the most. (See Option (3).)

Spec changes: Semantics of the ech extension changes; changes are needed to accommodate "Split Mode".

Option (2): Publicly indicate rejection

Protocol flow:

On input of the client's outer CH. If the server accepts ech, it uses the inner CH; and if the server rejects or does not support ech, then it uses the outer CH. If the server accepts or does not support ech, then it proceeds as usual; and if the server rejects, then it adds an ech extension to its SH with the updated ech configuration.
On input of the server's SH, EE, …, Finished. If the SH has the ech extension, then the client proceeds as if the outer CH was used and updates its ech configuration; otherwise, the client proceeds as if the inner CH was used. Decryption failure indicates either that the server does not support ech (i.e., outer CH was used) or the connection is under attack.

Pros

Is the least complex for the server to implement (same as Option (0)).

Cons

Sticks out, but only on rejection.
Complicates deployment: if the client offers ech to a server that has turned off support for the extension, then the connection will fail hard, as the client assumes lack of signal means that ech was accepted. (We could ameliorate this problem, at the cost of added complexity on the client side implementation.)

Spec changes: Semantics of the ech extension changes; ech configuration update is sent in the clear. (We could avoid this by sending the new configuration in a new extension in the EE.)

Option (3): Privately indicate acceptance

It may be worth considering an alternative to Option (1) that doesn't stick out as much. Namely, it's possible to make ech acceptance in the SH indistinguishable from ech rejection.

davidben commented 4 years ago

Thanks for writing this up! We've also been pondering this issue in the context of QUIC. In particular, the status quo (option (0)) requires the handshake and record layer coordinate to implement the trial decryption. This would need to happen separately in TLS, DTLS, and QUIC.

This is especially a nuisance for QUIC as it will typically cross the TLS library’s public API. TLS must configure two different keys for QUIC, then QUIC must do some trial decryption, and it must report back to TLS which keys were used. Public APIs and changes in them require coordination across groups, so ideally such interfaces stay as simple as possible. As in #264, I think an option that stays entirely in the handshake would apply to QUIC better.

cjpatton commented 4 years ago

That (option (1) or something like it) would be my preference as well.

bemasc commented 4 years ago

I would prefer Option 3, or something like it, if trial decryption is not acceptable. Martin Thomson's proposal from March (placing a signal in ServerHello.random) seems like the logical approach to me. There were some concerns about an active attack that can distinguish acceptance from rejection, but that's still much better than a cleartext signal.

davidben commented 4 years ago

To clarify, I think any of (1-3) would avoid the QUIC issues. The deployment concerns of (2) worry me, and the extent to which (1) sticks out makes me a little uncomfortable. I'm interested in exploring option (3).

cjpatton commented 4 years ago

Here's a suggestion for Option (3), worked out by me and some folks at Cloudflare. (Others might have had a similar idea, I don't mean to take credit!) The indication is a pseudorandom value output by the HPKE state. (If ech is not accepted, then a random value is used instead.)

Suggestion for Option (3)

Let context denote the HPKE state shared by the client and server when ech is accepted. Its output can be treated as pseudorandom, e.g., context.Export("tls13-ech-hrr-key', 16) in Section 7.1 is treated as being indistinguishable from random.

Protocol flow:

On input of the client's outer CH. If the server accepts ech, it uses the inner CH; and if the server rejects or does not support ech, then it uses the outer CH. If the server accepts, then it adds an ech extension to its SH with context.Export("tls13-ech-accept", 16) as the value; if the server rejects, then it adds an ech extension to its SH with a random, 16-byte string and an "ech_retry" extension to its EE with the updated ech configuration; if the server does not support ech, then it proceeds as normal, but MAY mimic ech rejection.
On input of the server's SH, EE, …, Finished. If the SH has the ech extension with payload context.Export("tls13-ech-accept", 16), then it proceeds as if the inner CH was used; otherwise it proceeds as if the outer CH was used, updating its ech configuration if applicable.

Pros

Similar to Option (1)

Cons

Similar to Option (1), but doesn't stick out as much. (It sticks out as much as the CH when dummy ech is used as described in Section 7.4.)

Spec changes: Semantics of ech extension changes; adds ech_retry (i.e., encrypted_client_hello_retry) extension.

chris-wood commented 4 years ago

(FWIW, I think the ServerHello.Random trick is due to @davidben or @dvorak42)

chris-wood commented 4 years ago

Weighing in: my vote goes to something like (3) as a means of not sticking out.

martinthomson commented 4 years ago

Before I even finished reading the opening post, I was thinking (3), so that seems like an obvious win. What this does is signals support for ECH at a server, but doesn't indicate anything more. It's an expensive way to signal a single bit, but that's not terrible.

Note that this can be used to indicate support in-principle without any config, because servers (or stacks) that want to join the crowd can always produce this random string. That would increase the number of servers that appear to support ECH, at very little cost.

cjpatton commented 4 years ago

I'm pleasantly surprised by how much consensus there is for Option (3). (BTW, kudos to the HPKE authors for designing a really nice API!) Does anyone have comments on the suggestion for Option (3) above?

bemasc commented 4 years ago

Signalling that the server supports ECH in cleartext seems like a significant loss to me, and I think we can avoid it. For example, if the first ~8 bytes of ServerHello.random were replaced by a MAC of the rest of the ServerHello, keyed from the HPKE context, that would be a tamperproof signal that the inner ClientHello was used.

chris-wood commented 4 years ago

Signalling that the server supports ECH in cleartext seems like a significant loss to me, and I think we can avoid it. For example, if the first ~8 bytes of ServerHello.random were replaced by a MAC of the rest of the ServerHello, keyed from the HPKE context, that would be a tamperproof signal that the inner ClientHello was used.

This would require security analysis to be sure it's actually safe.

But beyond that, what type of adversary are you considering?

martinthomson commented 4 years ago

Of course, if every stack produced the 16 byte extension in ServerHello, how is that materially different than 8 bytes embedded in ServerHello.random?

cjpatton commented 4 years ago

Overloading the SH.random this way is likely to violate assumptions made in existing security analyses for TLS 1.3. @bemasc's suggestion might turn out to be OK, but it would be safer to stick this in our own extension. That point notwithstanding, I'm worried about the broader precedent this could set, since overloading the semantics of CH.random and SH.random got us into trouble in earlier versions of TLS.

martinthomson commented 4 years ago

I don't know that this is necessarily the case, but unless we need to, avoiding more use of those bits is desirable. We could run out, and with 32 bytes, that's saying something.

chris-wood commented 4 years ago

I think what's clear is that the SH.random trick requires analysis, whereas the SH extension variant does not. Right?

cjpatton commented 4 years ago

I think they both need analysis, but the SH.random trick is much more invasive and likely to break things. In addition, if more "users" of the SH.random come along, then we would need to vet the interaction of our extension with theirs. (As @martinthomson points out.)

kazuho commented 4 years ago

+1 to option 3, and I tend to think that use of a SH extension is not a big concern.

It's correct that the extension might indicate the use of ECH. But from the perspective of a middlebox, that would always be the case when the client sends an ECH extension. Also, there would be other ways to determine if a large-scale server supports ECH (note: ECH is about hiding a tree in the wood, so it's about the cost of finding such woods). Therefore, in practice there's marginal benefit in making the server support signal indistinguishable.

bemasc commented 4 years ago

From @cjpatton

I think they both need analysis, but the SH.random trick is much more invasive and likely to break things. In addition, if more "users" of the SH.random come along, then we would need to vet the interaction of our extension with theirs. (As @martinthomson points out.)

I agree that analysis is needed, but I think using 8 bytes of SH.random is not a weird hack. It's at least a well-understood hack, because it's nicely parallel to the downgrade sentinel, and they never coexist.

From @kazuho

It's correct that the extension might indicate the use of ECH. But from the perspective of a middlebox, that would always be the case when the client sends an ECH extension.

TLS 1.2 middleboxes frequently take action based on the certificate, so ServerHello extensions seem likely to be used as well. If there's a ServerHello extension, I expect that some firewall vendor will offer a checkbox labeled "Block Encrypted ClientHello" based on this extension, in the "Security" section, and some admins will turn it on without understanding what it does. If the local network is normally used to access a small set of services, and none of them support ECH yet, then this will appear to work fine, perhaps for years. Then, if one of those services tries to enable ECH, they'll get angry phone calls from customers who can no longer access the service. For them, ECH will be ossified.

For QUIC, I agree that a visible extension is OK, since it's not too late to get it into 1.0. For TLS, I worry that it is too late.

MikeBishop commented 4 years ago

@bemasc, given that servers can respond with rejection even without actual ECH support, would your concern be alleviated if some servers started rejecting the GREASE ECH extensions in the near future?

cjpatton commented 4 years ago

Hi thread, since there's largely consensus here, I'm going to start working on a PR for Option (3). I'll post it here when it's ready. Thanks for your input!

bemasc commented 4 years ago

@MikeBishop Sure, the sooner many servers become "ECH-aware", the better. However, I expect that conservative institutional services will be very slow to update. In the extreme case, if a network is only used to access one service, then the broader ecosystem has limited direct impact.

I don't mean to claim that this ossification is inevitable, but I'd prefer to reduce the risk if we can find a reasonable alternative.

bemasc commented 4 years ago

@cjpatton Regarding your second-extension proposal above, it seems to me that there are two options here. In the one you wrote ("ech_retry"), the second extension (empty from the client) is paired with the retry configs. Syntactically, this makes the retry functionality optional: a client that doesn't support retry could omit the "ech_retry" extension.

An alternative would be to pair the retries with the "ech" extension (as in the current draft), and pair the context tag response with a new empty extension. I would call it "context_tag". Syntactically, this makes the context tag functionality optional: clients that support trial decryption could omit it.

I think the "context_tag" arrangement is preferable. It's not explicitly connected to ECH, which I think makes it more likely to be implemented and less likely to be blocked. In principle, it could be used for any future situation where the cryptographic context is ambiguous. It would also give clients the option to use trial decryption in TLS/TCP, and avoid it for QUIC. (Support would be mandatory for ECH and QUIC servers, but optional for clients.)

cjpatton commented 4 years ago

@bemasc, this sounds like an alternative to Option (3) that makes the explicit indication of acceptance optional. Just so we're on the same page, we think this is what you mean (@chris-wood and I really like this idea, if this is in fact what you intend):

Protocol flow:

On input of the client's outer CH. If the server accepts ech, it uses the inner CH; and if the server rejects or does not support ech, then it uses the outer CH.
- If the client requested a context tag in its ech:
  - If the server accepts, then it adds an ech_context_tag extension to its SH with context.Export("tls13-ech-accept", 16) as the value.
  - if the server rejects, then it adds an ech_context_tag extension to its SH with a random, 16-byte string.
- If the server rejects, then it adds and an ech extension to its EE with the updated ech configuration.
- if the server does not support ech, then it proceeds as normal, but MAY mimic ech rejection.
On input of the server's SH, EE, …, Finished. If the client didn't request a context tag, then it proceeds as in Option (0). If the client requested a context tag and the SH has the ech_context_tag extension with payload context.Export("tls13-ech-accept", 16), then it proceeds as if the inner CH was used; otherwise it proceeds as if the outer CH was used, updating its ech configuration if applicable.

bemasc commented 4 years ago

I think that's about what I mean, but I'm not clear on what you're saying the client would send. Here's what I was thinking:

The ech extension is not changed at all from the current draft.
We define a new extension: context_tag. In the ClientHello, it is empty. The response (in the ServerHello) contains 16 random bytes by default. If the server is using an ECH context, the value is context.Export(<constant>, 16) instead.
A ClientHello may contain either or both of ech and context_tag. The client should include context_tag in all their ClientHellos (inner and outer, ECH and GREASE) or none, for a given protocol (TCP or QUIC).
Servers that implement ech must also support context_tag. All other servers should support context_tag (which is trivial).
All QUIC servers must implement context_tag. (Possibly not something we can specify in this draft.)

cjpatton commented 4 years ago

This seems fine, except that it doesn't make sense to offer the context_tag extension without the ech extension, since the context_tag response is derived from the HPKE state.

I'm suggesting that the client "requests" a context tag in its ech extension. In particular, there's a flag in the extension that is "true' if it requests a tag and "false" otherwise. Doing this in a separate extension is fine, but you wouldn't want to offer that extension without also offering ech.

bemasc commented 4 years ago

I'm suggesting that the client "requests" a context tag in its ech extension.

I'm not sure this is allowed:

Implementations MUST NOT send extension responses if the remote endpoint did not send the corresponding extension requests... Upon receiving such an extension, an endpoint MUST abort the handshake with an "unsupported_extension" alert.

I also think this formulation is clearer from the perspective of a server that does not implement ECH.

BTW, here's a variation that's even simpler, and might work better with split mode:

If the ClientHello's context_tag is empty, the server responds with 16 random bytes.
Otherwise, the server echoes the contents.
The client includes an empty context_tag in the outer ClientHello, and one containing 16 random bytes in the inner ClientHello.

cjpatton commented 4 years ago

I also think this formulation is clearer from the perspective of a server that does not implement ECH.

Yup, I agree! This will be what the PR does.

cjpatton commented 4 years ago

PR is underway, I just need to revise the client and server behavior. Should be done tomorrow!

richsalz commented 4 years ago

Another possibility is that in the ECH message the client sends a FLAGS extension with a bunch of bits set, and the server responds with a FLAGS extension that has one of those bits set. The plaintext CH could have a superset of the ECH flags.

davidben commented 4 years ago

@richsalz Right, that would the flags-encoded of option (1). It does, however, stick out.

richsalz commented 4 years ago

I thought my "plaintext CH could have a superset of the ECH flags" handled the sticking out part.

davidben commented 4 years ago

Oh I see. Sorry, I misunderstood. Though that seems to also stick out: you can tell by just checking the flags for the superset, etc., rule or however we decide to encode it.

richsalz commented 4 years ago

I'm not pushing on this very hard, but if you grease flag-bits in the CH and pick one of those bits in the ECH. Maybe that doesn't work, so I'm willing to let this drop.

grittygrease commented 4 years ago

My model of the implementation here is that the server is made up of two components in a stack:

ECH-unwrapping portion of the server

can broadly be thought of as the proxy frontend, think large proxy provider pointed to by the site's DNS
has access to the ECH key and the fallback key only
forwards the inner client hello to the backend service if encryption works
finishes the handshake with updated ECH config with the updated key if ECH doesn't decrypt
can be deployed on a global edge network close to eyeballs (even places where TLS termination isn't safe)

Backend service

associated with the site certificate owner, think individual dedicated host behind proxy provider
has access to private key for the certificate
standard TLS 1.3 implementation, answers inner CHs or non-ECH CHs
can be deployed in a hardened/trusted datacenter (no need to put on global edge where content isn't decrypted)
could even be something like an AWS LB or other service that has no incentive to implement anything ECH-related

These two components communicate over the network, or in the degenerate case, are in the same machine.

My understanding is that Option (1) a) requires there to be signaling between the front-end and the back-end, and b) requires changes to the backend service to support ECH and the signaling

This additional complexity seems hard to justify and could hinder many deployment scenarios (especially ones we haven't thought of yet). What is the proposed mechanism for signaling whether an inner client hello was used or an outer client hello? This eliminates the ability to use existing TLS stacks without modification, and on top to support a signaling layer. Option (1) seems fraught.

It seems like Option (3) does not require signaling (signaling would come in-handshake from the new extension), but it does require backend changes to implement the new extension. This wouldn't work in situations where a third party doesn't have any motivation to implement ECH-related extensions, a tough loss, but not fatal. The problem here is that the combination of the presence of the client extension (signaling client support for ECH) and the server extensions (signaling server support for ECH) is going to make ECH-enabled traffic stick out from non-ECH-enabled traffic, even if dummy ECHs are sent 100% of the time from supporting clients. The ossification risk is huge.

Option (2) is my strong preference because it doesn't require backend changes and doesn't stick out unless ECH fails to be decrypted (which is a degenerate case).

ekr commented 4 years ago

It seems to me that we are trading off between three things:

Ability/difficulty of an adversary to determine that ECH is in use (as opposed to that ECH is possible with a server)
Compatibility with split mode.
Ease of client implementation

As I understand the situation we have (Y is good in each column)

                Can't detect ECH-in-use        Split-Mode       Easy Client Impl
Current           Y                              Y                N
Option 1          N                              N                Y
Option 2          Y*                             Y                ?
Option 3          Y                              N                Y

Except on rejection, for what that's worth.

Unless I misunderstand, Ben's proposals are just different spellings of (3).

I don't think anyone likes 1 so that leaves us with Current, 2, and 3.

I'm not that concerned about the "implementation" issues of Current, but I am quite concerned about the QUIC/DTLS issues, so I think we should strive to avoid that. That leaves us with 2 and 3.

As I understand it, the complexity issue is that if the server has turned off support for ECH entirely, then the connection will hard fail. However, with the current state of the spec I believe that this is effectively the state anyway. Here's the relevant text:

Note that authenticating a connection for the public name does not authenticate it for the origin. The TLS implementation MUST NOT report such connections as successful to the application. It additionally MUST ignore all session tickets and session IDs presented by the server. These connections are only used to trigger retries, as described in {{handle-server-response}}. This may be implemented, for instance, by reporting a failed connection with a dedicated error code.

IOW, we don't try to recover for cases where the client-facing server has just forgotten about ECH but rather where it has forgotten the ECH key. So, hard fail is actually not a problem here. I'm similarly not worried about signaling that the server is ECH-capable (especially in this case where it's actually not). So, I think my ? under "Easy Client Impl" turns out to be a Y.

With that said, I do find Option 3 aesthetically cleaner, though I'd struggle a bit to explain why and there is an obvious appeal to not having any public signaling in terms of being able to analyze it, though I'm not sure how strong an argument that is. I'm also not that worried about the sticking out piece, as I would expect that we can rapidly make a lot of servers accept and pretend do respond to grease-ECH -- which we would want to do in any case.

I'd like to hear from some other people about why they prefer (3) to (2) and whether there is some split mode-compatible variant of (3).

davidben commented 4 years ago

Option (2) means ECH becomes much riskier to deploy. In options (0), (1), or (3), a service can advertise ECH in the DNS and not largely panic about inconsistencies between hard-to-predict DNS cache behavior, and hard-to-predict future or present rollout decisions on individual servers.

The only hard commitment is that the servers advertised in DNS are able to speak on behalf of the public name. As long as that that's true, the client has an authenticated signal to recover from any DNS / server mismatches. In particular, if there is a problem and the service needs to roll back ECH (or if the service was in the process of rolling out ECH and missed a few spots initially), the existing TLS server behavior will be correctly interpreted by the client as an authenticated rollback, and the client can recover. This is true for every TLS feature I can think of to date: it is always safe to rollback. With the exception that proves the rule being 0-RTT, which got a note in the spec (RFC8446, appendix D.3.) to describe client behavior to restore safety.

Option (2) breaks this invariant.

ekr commented 4 years ago

@davidben. Sorry, I had misread the specification and I agree with you. To recap for those following at home presently the client's algorithm is:

1. Trial decrypt as if ECH accepted. If success, then proceed.
2. Trial decrypt as if ECH rejected. If failure, then abort.
3. If retry_keys is present, then restart with ECHConfig == retry_keys
4. If retry_keys is absent, then retry without ECH

It's point 4 that is relevant here. I had read the spec as requiring a hard failure, but it in fact recovers. That pushes me more towards (3).

bemasc commented 4 years ago

I'd like to hear ... whether there is some split mode-compatible variant of (3).

I believe I described one above. I'll rephrase, in case it was unclear:

context_tag is an extension that appears in ClientHello and ServerHello.
If the ClientHello contains a context_tag with empty contents, the server replies with a context_tag containing 16 random bytes.
Otherwise, the ServerHello's context_tag echoes the contents of the ClientHello's context_tag.
The client includes an empty context_tag in the outer ClientHello, and a context_tag containing 16 random bytes in the inner ClientHello.

ekr commented 4 years ago

@bemasc this is only partially split-mode compatible in that it requires changing every origin server. It just doesn't require coordination between the servers.

bemasc commented 4 years ago

True! For any extension implementing Option 3, I think we want those changes anyway. If a ServerHello extension is only implemented by ECH terminating servers, then its presence distinguishes real ECH from GREASE. If it's widely implemented, then that signal is diminished.

Alternatively, we could say that the new extension is only for QUIC, and make TLS/TCP stick to trial decryption.

ekr commented 4 years ago

Sure. I'm comparing it to (2).

cjpatton commented 4 years ago

Hi all, see PR #283 for our proposal for Option (3), with @bemasc's improvements incorporated. The main points:

The client may request confirmation of ECH acceptance (Option (3)), but the default behavior is Option (0).
Acceptance is indistinguishable from rejection, which is the primary motivation for choosing Option (3) over Option (1).
Split Mode works for Option (3), but the backend server needs to support ECH. (All it has to do is echo an extension sent by the client-facing server.) There's no change to the backend if confirmation isn't requested.

ekr commented 4 years ago

I don't understand why this is better than (3). Won't every client just send ech_confirm, in which case this is isomorphic to 3?

cjpatton commented 4 years ago

Option (0) sticks out less than option (3), which is why a client might opt to not send "ech_confirm".

cjpatton commented 4 years ago

Hmm ... I think there should be away to resolve the deployment issues between (2) and (3). Will post on Monday.

ekr commented 4 years ago

(3') This seems strictly worse than either (0) or (3). We'll have an odd mix of people doing one or the other and every server will have to do and test both.

cjpatton commented 4 years ago

@ekr:

(3') This seems strictly worse than either (0) or (3). We'll have an odd mix of people doing one or the other and every server will have to do and test both.

You're suggesting that confirmation SHOULD NOT be optional, correct? I'm Ok with this, but it seems to me that it's not especially complicated to implement this correctly on the server side. The hard bit is the client, since it has to do trial decryption.

@davidben:

Option (2) means ECH becomes much riskier to deploy.

I'd like to drill down on the problem of rolling back ECH. The essence of the problem is that an ECH server that advertised a configuration in the past must support ECH for as long as that configuration is valid. What are some "bad" events that may lead to this contract being violated?

The ECH secret key has been compromised, so the service needs to be shut off until a new key is rolled. In the meantime, the service can explicitly reject ECH without providing a new configuration. The client would take this as a signal of ECH being disabled by the server.
A bug is found somewhere in the TLS stack, and the quick fix is to revert to a point before the ECH code was committed. This is definitely a risk, but I wonder if there are ways to mitigate it. For example, if the configuration is valid for the next hour, say, then we must delay the rollback until the hour has lapsed. Alternatively, we can roll-back right away, and when the client aborts because of decryption failure, it might make a DNS query to see if ECH support has been turned off.

davidben commented 4 years ago

@cjpatton

I agree that the first issue is fine by option (2). (Though probably the mitigation would be to roll out a new key. One hopes that pipeline is already built out and regularly exercised by routine key rotation.)

I'm worried about the second one, but I think the characterization is too simple. TLS implementations are part of a complex system, both within the service and on the internet as a whole. Complex systems break unpredictably. Maybe it's a TLS bug. Maybe ECH inadvertently broke some assumption in some other part of the stack. Maybe some large client had a bug that only triggered due here to some quirk of the server. Maybe some printer happened to be using the code point and now breaks. Maybe it had nothing to do with TLS at all, but some other concurrent server change broke and the entire release needs to be rolled back.

Anything which interferes with the default response (rollback to a known good configuration) is expensive and risky. This risk needs to be communicated across a long game of telephone from...

the people who wrote the ECH to spec, to...
the people who implemented it in the TLS library, to...
the people who integrated it into some server software, to...
the people who perhaps packaged the server software into some OS release, to...
the people who perhaps shipped the OS release in some server appliance, to...
the people who manage the rollout of the change in some deployment, to...
the people who noticed a failure in a faraway system and are trying to mitigate it

This is not practical, especially if we want ECH to be widely adopted.

To the alternatives you list, when things go wrong, the priority is to get the service working again. Leaving it broken until the ECH config expires is thus not great. Moreover, expiry itself is a property of a complex system (DNS), so it may not be clear when it actually expires. Mandating a client retry on decryption failure is more plausible (compatible with rollback), but it relies on caching properties of the DNS, which is where much of the deployment mismatch risk comes from in the first place.

cjpatton commented 4 years ago

Mandating a client retry on decryption failure is more plausible (compatible with rollback), but it relies on caching properties of the DNS, which is where much of the deployment mismatch risk comes from in the first place.

Can you elaborate on this issue a bit more? Suppose the DNS and ECH provider are the same entity, and suppose that entity can synchronize the DNS response with the rollback. I guess one potential pitfall is that the client could use a DNS response cached by its operating system? Any others?

richsalz commented 4 years ago

Suppose the DNS and ECH provider are the same entity,

That's a simplifying assumption and doesn't always hold. Even within an enterprise, it's not uncommon for the DNS folks to be a separate group from those running the webservers.

tlswg / draft-ietf-tls-esni