go-acme / lego

Let's Encrypt/ACME client and library written in Go
https://go-acme.github.io/lego/
MIT License
7.88k stars 1.01k forks source link

Provide an option to verify propagation against a custom list of DNS-servers #2276

Open Jonher937 opened 4 days ago

Jonher937 commented 4 days ago

Welcome

How do you use lego?

Binary and Traefik.

Detailed Description

Hi, we've hit an issue using DNS-01 challenge with EAB and CNAME delegation with EJBCA as the CA, but this deployment scenario is probably going to be more common as ACME has started to be adopted in enterprise.

I have included a diagram to try and explain our interpretation of the current flow, and where we see an issue.

lego

  1. We issue a renewal request, all good so far
  2. We create the TXT record with the challenge
  3. LEGO uses our --dns.resolvers that points to the same DNS-server as the PKI will query for the challenge. These servers are used by lego to find the authoritative servers only as specified here. It finds the authoritative servers and can successfully query to verify they serve the TXT record with the correct value.
  4. LEGO gives the all clear to the PKI
  5. PKI queries DNS against the company wide DNS-service. This company DNS-service has network communication to the authoritative DNS-servers over tcp/udp 53. Queries are cached and if forwarded.

So now to the problem:

Essentially we'd want something like this (behind a flag such as --dns.propagation-servers of []string type) to specify what additional servers we'd want to append to the checkAuthoritativeNss function.

https://github.com/Jonher937/lego/blob/a152249a1a02146604936099d3bf1a9d13999280/challenge/dns01/precheck.go#L76-L88

Full commit can be found here and acts as an example. I have not found a good way to pass down a flag this deep into the process, but this issue is created as a topic for discussing how and if this could be implemented.

In the end our issues comes down to the DNS topology/implementation and slow (60+ seconds) propagations to the company DNS-service which the PKI verifies against.

We have also tries with the newly added --dns.propagation-wait option and have successfully managed to obtain certificates if we tweak it high enough for the propagation to happen, but this might randomly fail if the zone update has not yet made it's way to the company DNS-service.

ldez commented 4 days ago

Hello,

you can already do that:

Jonher937 commented 4 days ago

We have tried the disable-cp option: By setting this flag to true, disables the need to await propagation of the TXT record to all authoritative name servers.

Our problem is the reverse, the authoritative name servers replies correct and do so straight away. But the servers the PKI uses for validation are slow to serve the record.

What we need to do is either:

--dns.resolvers option documents it's only used to:

Set the resolvers to use for performing (recursive) CNAME resolving and apex domain determination. For DNS-01 challenge verification, the authoritative DNS server is queried directly. source for the quote

I tried using the options you suggested but it does not wait for any propagation at all, unless of course I use --dns.propagation-wait in addition to those values. We need to verify the propagation has occured on the non-authoritative DNS-service before telling the PKI it's ready to do the DNS validation. If we don't the PKI will check with DNS-service which does not know about this TXT value and it will fail the issuing of the certificate.

Jonher937 commented 3 days ago

Today we looked into why cert-manager has had more success and it looks like cert-manager has this option: dns01-recursive-nameservers-only documented here

--dns01-recursive-nameservers-only Forces cert-manager to only use the recursive nameservers for verification. Enabling this option could cause the DNS01 self check to take longer due to caching performed by the recursive nameservers.

danragnar commented 1 day ago

Makes sense to me to actually verify the TXT record on the provided recursive nameserver to minimize errors and load towards the PKI. Feels like a neater solution than the recently implemented --dns.propagation-wait flag.

I guess it's not needed in a regular setup (even with split-horizon DNS) if the PKI queries the authoritative nameserver, but otherwise there's high risk that the PKI would fail the challenge due to the zone not having refreshed on the recursor when the ACME client claims it's ready for challenge verification.

ldez commented 1 day ago

--dns.resolvers option documents it's only used to:

Set the resolvers to use for performing (recursive) CNAME resolving and apex domain determination. For DNS-01 challenge verification, the authoritative DNS server is queried directly. source for the quote

This option is not only used for zone detection and CNAME resolving, it's also used during propagation checks.

The NSs from --dns-resolvers are used first, before authoritative NS, but there is a difference with authoritative NS: lego browses all the resolvers and continues the process if at least one resolver returns a successful answer.

This is why I said that --dns-resolvers + --dns.disable-cp will do the same thing as your proposal. But as also I said, lego will not check all the --dns-resolvers during the propagation check.

ldez commented 20 hours ago

I thought about this issue, and I found 2 solutions:

The second option is better because it allows checking all the recursive NSs and all the authoritative NSs.

With this option lego can have several interesting combinations:

This can be changed in the future (inside a major version) to dns.disable-propagation-ans and dns.disable-propagation-rns and by default checking all the recursive NSs and all the authoritative NSs. IMHO, the migration path will be easier with this approach.

ldez commented 19 hours ago

I opened PR #2284