smallstep / certificates

🛡️ A private certificate authority (X.509 & SSH) & ACME server for secure automated certificate management, so you can use TLS everywhere & SSO for SSH.
https://smallstep.com/certificates
Apache License 2.0
6.36k stars 417 forks source link

Implement RFC8555 (ACME spec) § 8.2 #168

Open dcow opened 4 years ago

dcow commented 4 years ago

Section 8.2 of the ACME spec details exactly how client and server retry should be handled during a challenge validation. We should implement this part of the spec. Namely, retry state needs to be included in the challenge resource, client requests should be throttled, and we should set appropriate retry-after headers on the challenge resource response and include error information about the result of each attempted validation.

More details: https://github.com/smallstep/certificates/pull/162#issuecomment-579611292

wesgraham commented 4 years ago

Hey all :) Quick question on the desired approach: I'm about to modify both validate() calls (on http01 and dns01 challenges) in acme/challenge.go to account for retries. The simple approach I'm thinking of would be to make a blocking call to time.sleep() in between attempts. I recognize this could also be made non-blocking (using timer tasks/channels) - however it would likely require additional changes to the module. Any thoughts here?

dcow commented 4 years ago

We need to be able to cancel future retries if a client request to validate comes in while we'd otherwise be sleeping. As far as I can tell, that pushes things toward the timer approach so we can cancel pending retries and start over when new requests come in. If that's overly complicated, I guess sleeping routines could check whether they've been canceled when they wake up or something, but after 3 minutes of thought, I think the timer approach is still preferable. cc @maraino or @dopey for a second opinion.

wesgraham commented 4 years ago

Makes sense - in that case I'll utilize the timer approach. Will also modify ValidateChallenge() in acme/authority.go to actively read retry messages generated by the timer task. Will submit a PR by end of week.

dcow commented 4 years ago

I'm picking this up in #242.

tomasfreund commented 2 years ago

Hi, any updates on this? Some of our ACME challenges occasionally get stuck in invalid state because of (probably) transient DNS errors or propagation delays. If this was implemented, it would probably fix a lot of our problems. Thanks

dopey commented 2 years ago

Hey @tomasfreund we ended up tabling this without merging. Then the implementer moved on to a different project, and we've pretty much let it sit since then because there hasn't been much interest from the community.

To my recollection, my main issue with this PR (in the state that we left it) was that it was "complicated". It required state to be saved between executions of step-ca. The way it was being stored, and then refreshed on start up, wound up being leaked into the ca.json, which I wasn't happy with. Basically, the PR needs more work before we could merge it.

I'll bring this up at our weekly triage meeting, but my guess is that we don't have bandwidth to prioritize this issue at this time, given that the workaround is probably just to retry the request.

dcow commented 2 years ago

If you're okay with this only being guaranteed to work in non "HA" contexts (single instance of step-ca) then you should be able to take out the ordinal (implication being that any updates to the retry state would need to be transactional and handled by the backing store and the transaction would need to span the length of the validation attempt) or just leave it in and pin it to 0 in the code and hide it from the config json for the time being.

Beyond that, I believe the remaining work is just one more pass to address outstanding comments on the PR (there are some minor but blocking logic updates needed) and then a pass on the tests to be more comprehensive. I believe I updated the existing tests which cover the changes made in the PR and they generally pass. However, the PR also involves converting some cases where the server would have returned a 4xx into a 2xx + error. I discovered there was no existing coverage for those specific scenarios and recall we wanted to add some coverage as well as run a smoke test using existing clients to make sure they properly handle the updated behavior.

That's about all I remember after a quick skim of #242.

dopey commented 2 years ago

Agree w/ everything @dcow said above.

I think we weren't sure at the time, and still aren't, if we wanted folks trying to use an HA setup to need extra configuration. We're no further along in answering that question.

dopey commented 2 years ago

Hey @tomasfreund we had the opportunity to discuss this issue a bit this morning and came away with a few questions / comments.

1) There are some ACME clients that will wait to finalize (or request verification) until they've first verified that the DNS probe is successful. Which ACME client are you using, and is it possible that switching clients would be an easy workaround?

2) We're trying to gauge how many users are affected by similar issues. For those affected, please drop a thumbs up on the issue (or this comment) or leave a comment. I know it's silly, but it helps us with prioritization. Thanks!

tomasfreund commented 2 years ago

Hi, we are using cert-manager which tries to verify the DNS but our dns-01 challenges still get stuck sometimes. We get There was a problem with a DNS query during identifier validation in the cert-manager challenge and "error":{"type":"urn:ietf:params:acme:error:dns","detail":"There was a problem with a DNS query during identifier validation"} on the CA side. We have found that creating a new challenge or restarting cert-manager helps most of the time. I suspect that this might have something to do with DNS propagation - even if cert-manager checks the existence of the DNS record with a lookup, it does not guarantee that is has already propagated to the DNS server the CA is using. If there is no retry there, I can understand why it would fail. This is a really big issue for us because having to monitor and manually retry challenges defeats the purpose of automatic certificate management.