Possible bad propagation check with dns-01 challenge

oseiberts11 commented 1 year ago

Welcome

[X] Yes, I'm using a binary release within 2 latest releases.
[X] Yes, I've searched similar issues on GitHub and didn't find any.
[X] Yes, I've included all information below (version, config, etc).

What did you expect to see?

A sucessful certificate generation.

What did you see instead?

We had an error message from Let's Encrypt: 2022/11/30 10:24:59 error: one or more domains had a problem: [cloud.syseleven.de] acme: error: 400 :: urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up TXT for _acme-challenge.cloud.syseleven.de - the domain's nameservers may be malfunctioning", but the propagation check with 2 name servers apparently passed:

2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: Could not find solver for: tls-alpn-01
2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: Could not find solver for: http-01
2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: use dns-01 solver
2022/11/30 10:23:40 [INFO] [cloud.syseleven.de] acme: Preparing to solve DNS-01
2022/11/30 10:23:50 [INFO] [cloud.syseleven.de] acme: Trying to solve DNS-01
2022/11/30 10:24:00 [INFO] [cloud.syseleven.de] acme: Checking DNS record propagation using [8.8.8.8:53 4.4.4.4:53]
2022/11/30 10:24:10 [INFO] Wait for propagation [timeout: 10m0s, interval: 10s]
2022/11/30 10:24:10 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 10:24:20 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 10:24:30 [INFO] [cloud.syseleven.de] acme: Waiting for DNS record propagation.
2022/11/30 10:24:47 [INFO] [cloud.syseleven.de] acme: Cleaning DNS-01 challenge

With a successful result, before the last line, there would have been a message like [cloud.syseleven.de] The server validated our request.

Note that we used 4.4.4.4 which was a working public name server in the past, but apparently no longer. It must have stopped working relatively recently. We discovered this while trying to debug this. As of now, it does not respond to any query.

However there was no indication from lego that there was a problem and it looks like it accepted the broken server as working, and continued on, as if everything was working.

Even when we replaced 4.4.4.4 with another server, the next attempt failed in the same way.

This makes me think that the propagation check doesn't really work. How else could a random nameserver serve the correct TXT record (I surely hope that this is part of the check, right?) but when Let's Encrypt does the query it fails. I noticed that you get the SERVFAIL error also if the TXT record is simply missing. It seems extremely unlikely that the name servers worked long enough for a query via 8.8.8.8 to work, and then suddenly broke when Let's Encrypt

How do you use lego?

Docker image

Reproduction steps

We use a gitlab CI pipeline to run this command periodically: lego --accept-tos --dns, designate --path /tmp/lego --dns.resolvers 8.8.8.8 --dns.resolvers", 4.4.4.4 --server=https://acme-v02.api.letsencrypt.org/directory --email noreply@syseleven.de --key-type rsa4096 -d "*.cloud.syseleven.net" -d "*.infra.sys11cloud.net" -d "*.infrabk.sys11cloud.net" -d "*.infrabl.sys11cloud.net -d "*.infrafe.sys11cloud.net" -d "cloud.syseleven.de" renew --preferred-chain "ISRG Root X1"

Version of lego

Our docker image is based on

`FROM goacme/lego:v4.9.1`

Logs

See above

Go environment (if applicable)

No response

oseiberts11 commented 1 year ago

I found issues #1710 and #1597 which from the headline seemed related. However they both seem to have a sort of opposite problem: the propagation check never passes ("time limit exceeded: last error: NS xxxx. did not return the expected TXT record").

ldez commented 1 year ago

Hello,

I think the problem is not related to the propagation check, it's an error from Let's Encrypt. Let's Encrypt doesn't use the nameservers define in the resolvers.

oseiberts11 commented 1 year ago

I understand that Let's Encrypt doesn't use the same resolvers as the ones I supply.

What I'm suspecting a bit is the following scenario: the propagation check is passing too soon, before our DNS server actually has the TXT record available. It always takes a bit of time to process this. Then when Let's Encrypt is looking, the record is not there yet and the challenge fails.

I'm strenghtened in this theory by the fact that using a totally failing name server doesn't result in a failing propagation check.

dmke commented 1 year ago

the propagation check is passing too soon, before our DNS server actually has the TXT record available

The propagation check verifies that another (recursive) resolver can retrieve the TXT record from your DNS server. It can only proceed iff. your DNS server has made the TXT record available.

There is however a catch: when your DNS provider has an internal update propagation delay (say their server fleet does rolling updates, and some of their servers already have your TXT record, while others do not, yet), you can easily run into a situation where 4.4.4.4 queries the set of DNS servers with the update, and Let's Encrypt queries the other set.

Internally, Lego will sleep for DESIGNATE_POLLING_INTERVAL seconds (10 by default) before probing the first propagation check. You could increase this to say DESIGNATE_POLLING_INTERVAL=60 to wait a minute between the DNS update and query.

Two additional notes:

Increasing the polling interval might also increase the total propagation check time, as subsequent checks will also wait for the interval to pass.
You need to keep the polling interval below the DNS timeout (DESIGNATE_PROPAGATION_TIMEOUT=600 by default)

oseiberts11 commented 1 year ago

It could perhaps be something like that. I am trying to approach the matter from different angles (possibly there was some network congestion at Let's Encrypt, for example), but so far nothing jumps out as a conclusive root cause.

Would it be possible for debugging purposes to keep the TXT record and not delete it? What we see, when observing during a run of lego, is that the record gets created and deleted a bit later, but due to the latencies of various commands, it is difficult to determine exactly how this relates to the checks by Let's Encrypt.

And I will definitely try increasing DESIGNATE_POLLING_INTERVAL=60 to see if that makes a difference.

dmke commented 1 year ago

Would it be possible for debugging purposes to keep the TXT record and not delete it?

No, not out-of-the-box. I can however prepare an unofficial build (later, when I'm back from work). Do you prefer a binary (if so, for which OS/variant), or a Dockerfile to build it yourself?

oseiberts11 commented 1 year ago

Thanks for the offer. I think the Dockerfile would work fine.

dmke commented 1 year ago

@oseiberts11, this should help you debugging:

Dockerfile

[BuildKit](https://docs.docker.com/build/buildkit/) required! ```Dockerfile # syntax=docker/dockerfile:1.4 FROM golang:1-alpine as builder RUN apk --no-cache --no-progress add make git curl ENV GO111MODULE on WORKDIR /go # clone repository RUN <

The patch can be found here: https://gist.github.com/dmke/f2d31407cc17d7801a0f32ebbe6cd283.

To build a drop-in-replacement for the goacme/lego:v4.9.1 image, copy the Dockerfile on your system and run:

$ DOCKER_BUILDKIT=1 docker build --tag syseleven/lego:v4.9.1-debug1777 .

You probably don't want to distribute the image, as it skips the cleanup procedure entirely.

oseiberts11 commented 1 year ago

Thanks for the Dockerfile, we tried another round of testing with it. Another failure again, unfortunately. But at least we could see that all designate's name servers did get the TXT record. And none of them returned SERVFAIL as far as we could see.

We formed another theory about something that may go wrong.

With the sparse description of the --dns.resolvers option, we kind of expected that they would be used to check that the TXT record had propagated / could be found via all of them. This however would lead to cache poisoning. Consider the following scenario:

a query to (say) 1.1.1.1 for the TXT record is done before designate has finished making it available from the authoritative servers. It is a bit slow, so this can easily happen.
various intermediate servers and 1.1.1.1 now get an NXDOMAIN on that record.
this negative answer can be cached for up to 5 minutes (300 seconds).
desigate finishes doing its thing.
Let's Encrypt queries the TXT record, but gets a cached NXDOMAIN. It isn't so unlikely that it shares a name server with the previous case, because the delegation goes A.ROOT-SERVERS.NET -> m.gtld-servers.net -> ns04.syseleven.de -> ns04.cloud.syseleven.net. There is definitely a chance that a this query hits the same nsxx.syseleven.de server that got hit before.

Observing the code, it looks like lego doesn't do exactly that, but close enough. Our plan was first to replace the name servers given to --dns.resolvers by the authoritative servers. But what the code does would likely make that fail, since the servers are indeed recursively and for other lookups than just that.

// checkDNSPropagation checks if the expected TXT record has been propagated to all authoritative nameservers.
func (p preCheck) checkDNSPropagation(fqdn, value string) (bool, error) {
    // Initial attempt to resolve at the recursive NS
// This could cause negative cache poisoning! 
    r, err := dnsQuery(fqdn, dns.TypeTXT, recursiveNameservers, true)
    if err != nil {
        return false, err
    }

    if !p.requireCompletePropagation {
        return true, nil
    }

    if r.Rcode == dns.RcodeSuccess {
        fqdn = updateDomainWithCName(r, fqdn)
    }

    authoritativeNss, err := lookupNameservers(fqdn)
    if err != nil {
        return false, err
    }

    return checkAuthoritativeNss(fqdn, value, authoritativeNss)
}

So it first publicly and recursively asks for the TXT record. This can cause cache poisoning. Then it finds the authoritative name servers for the TXT record, and then it asks all of those servers if the TXT record is already there.

For testing purposes I think we will remove the lines before lookupNameservers() and see if that helps.

dmke commented 1 year ago

That explanation is certainly sound.

In this case, wouldn't it be prudent to increase DESIGNATE_POLLING_INTERVAL to something > 500 and DESIGNATE_PROPAGATION_TIMEOUT to at least twice that amount?

OTOH, I was under the impression that Let's Encrypt directly queries the authoritative nameservers, and/or disregards TTLs for negative DNS answers. I can't find where I've read about this though, so maybe I'm misremembering (I haven't read the relevant sections of RFC 8555 in a while, either).

oseiberts11 commented 1 year ago

I tried with such a long DESIGNATE_POLLING_INTERVAL and I think it is too long for Let's Encrypt's patience. At least that is how I could interpret the resulting error: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: "5CA2kavYhd1YWvGGL2eAfq3rpqD4t-UjpyxauWnUX6OERG4"

I also reduced the TTLs of the SOA records of the zones to 1. The zones are special-purpose for the acme challenge, so that should be ok. It didn't help. The next attempt still had some dns lookup problem, even though the message was partly different: "2022/12/07 14:20:46 error: one or more domains had a problem:", "[*.service.overlay.sys11cloud.net] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.service.overlay.sys11cloud.net - the domain's nameservers may be malfunctioning".

Perhaps I should link to the conversation on the Let's Encrypt forum: https://community.letsencrypt.org/t/was-there-some-dns-lookup-failure-in-recent-days/188778/7

ldez commented 1 year ago

urn:ietf:params:acme:error:badNonce is handled automatically by lego which renews the nonce, this error is not a blocking error, it's just a kind of warning.

oseiberts11 commented 1 year ago

So, whatever the problem was, it mysteriously went away, at least long enough last friday afternoon when I tried it again.

Given what's said in the thread linked before, I would say that the way LE does lookups is not affected by how LEGO does it. So as far as I can tell now, there is nothing wrong with LEGO.

The only thing that could maybe be improved, is a better description of the --dns.resolvers option. We made incorrect assumptions about how the given resolvers are used We thought that those would be used to check availability of the TXT record, and that success for all of them would be required. Instead, if I understood it correctly now, they are used to resolve a CNAME for the TXT record (if there is one), and that requires only one working name server. Then the authoritative name servers for the TXT record are found and they are queried directly.

dmke commented 1 year ago

Thank you for keeping us updated!

The only thing that could maybe be improved, is a better description of the --dns.resolvers option.

Agreed. When I'm back home, I'll prepare a PR for this.

microcrash commented 1 year ago

This helps explain a situation that I have faced during my last renewal. When verifying the challenge token, I kept getting no response on the TXT lookup and this was due to my ISP blocking the ipv6 from the AAAA record. Confirmed by pinging it locally (no reply) and by doing the same on a remote server (received reply). To bypass this, I added a record in my hosts file to resolve directly to the ipv4. It would be nice to have some additional flexibility. Either: 1. If AAAA record has no response, retry on A record. or 2. Set a setting to force using A or AAAA records if present.

dmke commented 1 year ago

If AAAA record has no response, retry on A record

Lego's DNS resolver library (github.com/miekg/dns) uses a plain Go net.Dialer to send UDP packets, which follows RFC 6555 ("Fast Fallback") by default.

I suspect ypur ISP blocks IPv6 by simply dropping packets, and not returning ICMP Admin-Prohibited messages, which would throw that of. The solution here is to either find a capable ISP, or to disable IPv6 on your host entirely.

Set a setting to force using A or AAAA records if present

I'd be hesitant to introduce such a config setting, as it would introduce a disparity between the way Lego and Let's encrypt function. This could easily lead to more confusion and make debugging even more tedious.

I suspect the far better solution is to add query logging to the DNS resolver, to see which queries went out, and which answers came back - but then again, tcpdump -ni $IFACE udp port 53 would to the same.

ieure commented 1 year ago

I also started encountering this behavior, out of nowhere, after a year+ of certs renewing automatically without any issue.

I'm using the Docker image, running the command:

lego --path /lego --accept-tos --email public@lowbar.fyi --dns joker --domains 'lab.pins.atomized.org' --domains '*.lab.pins.atomized.org' renew

Logs:

2023/02/21 18:16:55 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/authz-v3/198980437317 :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: "1AADws2AaIsRrYBcCtVB7bxCXSW0j6wcJCtdxyp3Lj9JJMA"
2023/02/21 18:16:56 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/198980437317
2023/02/21 18:16:56 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/205312993556
2023/02/21 18:16:57 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/205312993566
2023/02/21 18:16:57 error: one or more domains had a problem:
[*.lab.pins.atomized.org] time limit exceeded: last error: NS c.ns.joker.com. did not return the expected TXT record [fqdn: _acme-challenge.lab.pins.atomized.org., value: (redacted)]: 
[lab.pins.atomized.org] time limit exceeded: last error: NS a.ns.joker.com. did not return the expected TXT record [fqdn: _acme-challenge.lab.pins.atomized.org., value: (redacted)]:

I have solid IPv4 and IPv6 connectivity. I'll wait a couple hours to make sure I don't run into cached NXDOMAIN and set DESIGNATE_POLLING_INTERVAL=60, to see what happens. My certs expire in 99 hours, hopefully it'll work again before that. Not thrilled about the situation.

tomkel commented 11 months ago

// checkDNSPropagation checks if the expected TXT record has been propagated to all authoritative nameservers.
func (p preCheck) checkDNSPropagation(fqdn, value string) (bool, error) {
  // Initial attempt to resolve at the recursive NS
// This could cause negative cache poisoning! 
  r, err := dnsQuery(fqdn, dns.TypeTXT, recursiveNameservers, true)
}
So it first publicly and recursively asks for the TXT record. This can cause cache poisoning. Then it finds the authoritative name servers for the TXT record, and then it asks all of those servers if the TXT record is already there.

This line which can cause negative caching seems like the most likely culprit for this intermittent issue, which I'm having with a different DNS provider (not openstack/designate), because My SOA says to cache negative responses for an hour (3600s). I traced the line back to a commit from 7 years ago by Jan, which looks like it was added to bootstrap CNAME resolution. Is there another way that won't cause negative caching?

commit: https://github.com/go-acme/lego/commit/b594acbc2a17f1643568bb62d2ccf8250e444219?diff=split&w=1#

before: https://github.com/go-acme/lego/blob/c97b5a52a13532fe637e26b7bd00ba313fc1c51c/acme/dns_challenge.go#L77-L90 after: https://github.com/go-acme/lego/blob/b594acbc2a17f1643568bb62d2ccf8250e444219/acme/dns_challenge.go#L79-L105

go-acme / lego