gregtwallace / certwarden

Cert Warden is a centralized ACME Client. It provides an API for certificate consumers to fetch their individual keys and certs with API keys.
https://www.certwarden.com/
Other
231 stars 7 forks source link

Simultaneous Duplicate Challenge Resource Names Cause Worker Failure #23

Closed ntimo closed 1 year ago

ntimo commented 1 year ago

Hey, when I try to renew my certifictes I get this error in the logs and they are stuck in the pending state:

8/7/2023, 4:47:15 AM, error, orders/worker.go:130, dns-01 (acme.sh) can't add resource (_acme-challenge.domain.com), already exists and content does not match
gregtwallace commented 1 year ago

Strange, this is one of those errors I'd never actually expect to see.

Which dns provider? Can you search your log around the time of the first order and see if you have a log entry "dns-01 (acme.sh) could not remove resource"

Are you by any chance using multiple ACME Servers around the same time? For example, renewing both Production and Staging certs at the same time for the same domain? I think this might cause the issue in that they'd both try to add the same domain with different values at the same time.

ntimo commented 1 year ago

No I only use once acme account on prod. But I have multiple domains that need to be renewed on it. My dns provider is nsupdate. I also have a few invalid orders in the list of the cert. Since the renewal failed because of a bad file permission issue in the nsupdate key. But after fixing this I am greeted with the above error when trying to renew a cert.

ntimo commented 1 year ago

Yes I also see the log line you mention.

gregtwallace commented 1 year ago

It is hard for me to know the exact issue without your setup. However, I reworked the only thing that I could logic out as a possible issue. Can you build the master branch and test if the issue is fixed?

If not, what OS are you on? I might be able to build it for you.

gregtwallace commented 1 year ago

Can you send me your debug log please? The only other thought I have is that delete is being called before add but I can’t figure out how that would be happening either.

ntimo commented 1 year ago

Sure I have send you an email.

gregtwallace commented 1 year ago

This issue is triggered when LeGo attempts to provision more than one of the same resource name at the same time. This can be triggered by:

  1. Placing one or multiple orders containing both a domain and its matching wildcard (e.g. test.com and *.test.com). This happens because per RFC 8555 the wildcard resource name is the same as the non-wildcard resource name but the needed resource value will not be the same because these are two different Authorizations and thus two different Challenges and each challenge has a random token value,
  2. Placing orders to more than one ACME Provider (e.g. Let's Encrypt and Let's Encrypt Staging) at the same time for the same domain (or matching wild card). This happens because each Provider will need the same resource name but again the resource value will differ due to being different Challenges and therefore different tokens, or
  3. Any other scenario that would need different values to be associated with the same resource name (there might not be any other scenarios, I'm not sure).

The fix to this is to have a master tracker in the challenges package that will block requests for a resource name if that name is already in use. The blcok will be removed once the previous request has completed and has been deprovisioned.

The previous commit on this issue fixes the problem but is sloppy. The next commit will clean this up and make it more efficient.