letsencrypt / boulder

An ACME-based certificate authority, written in Go.
Mozilla Public License 2.0
5.21k stars 607 forks source link

MPIC: CAA rechecks should be multi-perspective #7808

Open jsha opened 5 days ago

jsha commented 5 days ago

Right now, VA.PerformValidation checks both the challenge and CAA, for both local and remote perspectives. However, if issuance happens more than 7 hours after validation, the RA rechecks CAA by calling VA.IsCAAValid. This function only checks the local perspective.

We want two things:

  1. Both initial CAA checks and CAA rechecks should check remote perspectives.
  2. Initial CAA checks and CAA rechecks should use very similar code paths.

One way to achieve (1):

Another way to achieve (1) and (2) together:

(If we go this route, we should share as much code as possible between PerformValidation and PerformValidationWithoutCAA)

Another approach that achieves (1) and (2) together:

This approach will increase latency for HTTP requests to /acme/finalize/. Currently that endpoint only incurs the overhead of checking DNS for a fraction of requests (those that reuse old authzs). If we take this approach it would incur that overhead for all requests. But it would bring our code more closely in alignment with how the BRs are written. CAA checking is specified as an issuance-time requirement, not a validation-time requirement. The mismatch between our implementation and the BRs led to an incident in 2020.

jsha commented 5 days ago

Hah, I missed that we in fact implemented "Add remote perspective checking to VA.IsCAAValid" in #7061 / #7221.

So I think that just leaves us with wanting to achieve:

Initial CAA checks and CAA rechecks should use very similar code paths.

I do still think it will be simplest to move all CAA checking to issuance time.

aarongable commented 5 days ago

I do still think it will be simplest to move all CAA checking to issuance time.

I'd prefer to do the opposite -- move all CAA checking to validation time, and shorted authorization lifetimes to be 7 hours and thereby remove the need for rechecking.

jsha commented 5 days ago

They aren't mutually exclusive. We could do one, then switch to the other. An important difference is that "All CAA checking happens at issuance time" is not as user-impacting as "maximum authz reuse is 7 hours," which means we can do it sooner. Given that we're adding complexity for MPIC, I think there's a need in the short term for an offsetting decrease in complexity and risk.

aarongable commented 5 days ago

I worry that the user impact of CAA-during-finalize will be surprisingly large, since CAA can be slow and finalize is (unfortunately) synchronous. I suspect that the number of finalize calls actually doing rechecking today is such a vanishingly small fraction that any failures or timeouts due to it are lost in the noise -- and that any such failures would suddenly represent a large fraction of issuance if every finalize did CAA checks. But hopefully a dig into the data will lay these fears to rest.

aarongable commented 5 days ago

Regardless, I think this exact discussion is why we decided to move forward with the middle option described in the top post here: create new "ValidateChallenge" and "CheckCAA" RPCs which perform only their named function (no validate-and-caa), and then let the RA decide when to call each of them. Then delete the old PerformValidation and IsCAAValid methods. This refactoring simplifies the VA code and allows us to do either caa-at-finalize or short-lived-authzs without having to continue to worry about the VA code: such changes would only be at the RA level.