ekr commented 4 months ago

In https://mailarchive.ietf.org/arch/msg/tls/bvvWbtxJAiMfilfy32EvdaCszQ4/, Elardus Erasmus suggests that the suggestion to limit retries for servers which have rejected ECH is too broad, and we should have different guidance for key/config changes than for just disabling ECH. He also suggests a "holdoff period" when connections were disabled. I do think that this gets at something real, but I'm not sure of the proposed resolution.

ISTM that there are two issues:

Limiting Retries

If the server securely disables ECH, then the client should expect a subsequent connection to succeed and there is no way for the server to ask for a retry. So I don't think we need a limit here.

However, if the server supplies a new key, it's possible to get into some kind of loop, for instance, if there are two servers with different keys and the client bounces between them. Rather than suggesting that there be an unspecified limit, I think it would be better just to forbid a connection initiated from a retry_config from itself causing a new connection. That means that if servers are so misconfigured that they offer a retry_config that then doesn't work that they will just fail, but that should be rare, and in any case, you're in deep trouble in that case.

Remembering Retry

Unless I've misread the text, it implies but doesn't say that the retry_configs and secure disablement are for this connection only and that the client should not use them even for a connection which is initiated immediately after the one where it retries (as in multiple HTTP 1.1 connections). As Erasmus indicates, this increases load on the server as well as latency and has to be balanced with the observation that the retry_configs are a tracking vector (as noted in the text). He suggests a "holdoff" where you don't retry ECH if it was securely disabled, but no change in cases where a valid retry_config was provided.

I wonder if we should be a bit more expansive about this, given that we have other TLS-related tracking vectors that clients already need to mitigate (e.g., resumption) and there are already mechanisms to handle this. With that in mind, how about:

Clients should remember updated configuration or disabled ECH for up to the lifetime of the ECHConfig (this also gives some content to "holdoff").
Clients should partition retry_configs in both time and space the same way they would for resumption PSKs (this is something browsers already do).

Am I missing something important?

davidben commented 4 months ago

There was one third issue, which is how the retry interacts with HTTPS-RR/SVCB's multi-CDN business. That is, if I connected previously and got retry configs, how do I know whether those retry configs apply to the next connection? For all I know, they may have gone through a completely different set of HTTPS-RR/SVCB records.

At the time I proposed the retry mechanism, SVCB's design was a little less set yet, and while, as you say, the tracking concerns are actually pretty easy to resolve by saying "do the same thing you'd do with all other state", I didn't want to open both of those can of worms at the same time as working through the retry mechanism itself. So my initial proposal just said it was a one-time use, and I figured we could work through all that when/if someone was enthusiastic about remembering them, but the enthusiasm never quite came up.

martinthomson commented 4 months ago

I think that the multi-CDN scenario is adequately addressed by having the retry only apply to the immediate case. So it does get harder if you consider what conditions might allow you to reuse that configuration. Things that seem likely to invalidate that choice:

Changes to DNS records (that are invisible to clients). This is limited by TTL at that layer; people configuring DNS know that they can't rely on propagation for TTL+X (where X is ㄟ( ▔, ▔ )ㄏ more often than not).
Picking a different DNS RR. At any stage of the name resolution process.
Changes to network conditions at the client end. The client might be able to detect this, but not reliably.
Changes to routing such that a different path is chosen.
Changes to server deployments. Including servers going down.

The sensible thing to do is not to specify anything concrete here, but to explain what might happen that would cause a cached value to become bad, let clients make up their own mind about how long to keep them.

However, if a client uses a retry configuration and it turns out to be bad, I would expect that the server would want to be able to correct it with another retry configuration. That conflicts with a simpler rule you might write about not accepting retry configurations if the client is using one already.

So I see two options here:

Keep things simple and only allow one retry. If you receive a retry, use it, then discard it. If you get a retry when you are using a retry, give up.
Make things more complex. A retry configuration can be used for up to some time. Clients need to limit the time over which they use that according to both their own rules for partitioning state. They might also consider dropping the value if name resolution cannot produce the same result[^1] or they detect any change to the network conditions. However, if a retry configuration is used after the first use - or maybe after a minimal period - a client needs to use a retry configuration they get from the server.

That last bit is fiddly. It's something that a server deployment might like to be able to manage with an explicit, TLS-level TTL for a retry configuration. Also, the whole bit where the first retry is special is going to be very annoying to implement.

[^1]: This makes certain forms of RR selection design very likely to produce this outcome. That's OK, because it is a safe failure. Though it complicates the choice for those who might prefer that sort of deployment, because it now comes with additional costs in the form of the handshakes that Erasmus is suggesting we try to minimize.

davidben commented 4 months ago

However, if a client uses a retry configuration and it turns out to be bad, I would expect that the server would want to be able to correct it with another retry configuration.

One complication here: if you got a retry config from CDN A and accidentally used it for CDN B, you're likely to not have the same public names, and CDN B will not actually be able to give you a retry config.

martinthomson commented 4 months ago

I think that this falls under the "if DNS gives you different answers, don't use the old retry configuration" clause.

Presumably, you would have resolved the name and concluded that the retry configuration was OK to reuse. But the DNS configuration should not use the same SVCB records at all, because the ECH configuration you have now would include the CDN B public name. The use of different public names also rules out anycast as a means of getting into this state.

tlswg / draft-ietf-tls-esni

Memory for ECH rejection #604

Limiting Retries

Remembering Retry