Open oseiberts11 opened 1 year ago
I found issues #1710 and #1597 which from the headline seemed related. However they both seem to have a sort of opposite problem: the propagation check never passes ("time limit exceeded: last error: NS xxxx. did not return the expected TXT record").
Hello,
I think the problem is not related to the propagation check, it's an error from Let's Encrypt. Let's Encrypt doesn't use the nameservers define in the resolvers.
I understand that Let's Encrypt doesn't use the same resolvers as the ones I supply.
What I'm suspecting a bit is the following scenario: the propagation check is passing too soon, before our DNS server actually has the TXT record available. It always takes a bit of time to process this. Then when Let's Encrypt is looking, the record is not there yet and the challenge fails.
I'm strenghtened in this theory by the fact that using a totally failing name server doesn't result in a failing propagation check.
the propagation check is passing too soon, before our DNS server actually has the TXT record available
The propagation check verifies that another (recursive) resolver can retrieve the TXT record from your DNS server. It can only proceed iff. your DNS server has made the TXT record available.
There is however a catch: when your DNS provider has an internal update propagation delay (say their server fleet does rolling updates, and some of their servers already have your TXT record, while others do not, yet), you can easily run into a situation where 4.4.4.4 queries the set of DNS servers with the update, and Let's Encrypt queries the other set.
Internally, Lego will sleep for DESIGNATE_POLLING_INTERVAL
seconds (10 by default) before probing the first propagation check. You could increase this to say DESIGNATE_POLLING_INTERVAL=60
to wait a minute between the DNS update and query.
Two additional notes:
DESIGNATE_PROPAGATION_TIMEOUT=600
by default)It could perhaps be something like that. I am trying to approach the matter from different angles (possibly there was some network congestion at Let's Encrypt, for example), but so far nothing jumps out as a conclusive root cause.
Would it be possible for debugging purposes to keep the TXT record and not delete it? What we see, when observing during a run of lego, is that the record gets created and deleted a bit later, but due to the latencies of various commands, it is difficult to determine exactly how this relates to the checks by Let's Encrypt.
And I will definitely try increasing DESIGNATE_POLLING_INTERVAL=60
to see if that makes a difference.
Would it be possible for debugging purposes to keep the TXT record and not delete it?
No, not out-of-the-box. I can however prepare an unofficial build (later, when I'm back from work). Do you prefer a binary (if so, for which OS/variant), or a Dockerfile to build it yourself?
Thanks for the offer. I think the Dockerfile would work fine.
@oseiberts11, this should help you debugging:
The patch can be found here: https://gist.github.com/dmke/f2d31407cc17d7801a0f32ebbe6cd283.
To build a drop-in-replacement for the goacme/lego:v4.9.1
image, copy the Dockerfile on your system and run:
$ DOCKER_BUILDKIT=1 docker build --tag syseleven/lego:v4.9.1-debug1777 .
You probably don't want to distribute the image, as it skips the cleanup procedure entirely.
Thanks for the Dockerfile, we tried another round of testing with it. Another failure again, unfortunately. But at least we could see that all designate's name servers did get the TXT record. And none of them returned SERVFAIL as far as we could see.
We formed another theory about something that may go wrong.
With the sparse description of the --dns.resolvers
option, we kind of expected that they would be used to check that the TXT record had propagated / could be found via all of them. This however would lead to cache poisoning. Consider the following scenario:
Observing the code, it looks like lego doesn't do exactly that, but close enough. Our plan was first to replace the name servers given to --dns.resolvers
by the authoritative servers. But what the code does would likely make that fail, since the servers are indeed recursively and for other lookups than just that.
// checkDNSPropagation checks if the expected TXT record has been propagated to all authoritative nameservers.
func (p preCheck) checkDNSPropagation(fqdn, value string) (bool, error) {
// Initial attempt to resolve at the recursive NS
// This could cause negative cache poisoning!
r, err := dnsQuery(fqdn, dns.TypeTXT, recursiveNameservers, true)
if err != nil {
return false, err
}
if !p.requireCompletePropagation {
return true, nil
}
if r.Rcode == dns.RcodeSuccess {
fqdn = updateDomainWithCName(r, fqdn)
}
authoritativeNss, err := lookupNameservers(fqdn)
if err != nil {
return false, err
}
return checkAuthoritativeNss(fqdn, value, authoritativeNss)
}
So it first publicly and recursively asks for the TXT record. This can cause cache poisoning. Then it finds the authoritative name servers for the TXT record, and then it asks all of those servers if the TXT record is already there.
For testing purposes I think we will remove the lines before lookupNameservers()
and see if that helps.
That explanation is certainly sound.
In this case, wouldn't it be prudent to increase DESIGNATE_POLLING_INTERVAL
to something > 500 and DESIGNATE_PROPAGATION_TIMEOUT
to at least twice that amount?
OTOH, I was under the impression that Let's Encrypt directly queries the authoritative nameservers, and/or disregards TTLs for negative DNS answers. I can't find where I've read about this though, so maybe I'm misremembering (I haven't read the relevant sections of RFC 8555 in a while, either).
I tried with such a long DESIGNATE_POLLING_INTERVAL
and I think it is too long for Let's Encrypt's patience. At least that is how I could interpret the resulting error: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: "5CA2kavYhd1YWvGGL2eAfq3rpqD4t-UjpyxauWnUX6OERG4"
I also reduced the TTLs of the SOA records of the zones to 1. The zones are special-purpose for the acme challenge, so that should be ok. It didn't help. The next attempt still had some dns lookup problem, even though the message was partly different: "2022/12/07 14:20:46 error: one or more domains had a problem:", "[*.service.overlay.sys11cloud.net] acme: error: 400 :: urn:ietf:params:acme:error:dns :: During secondary validation: DNS problem: SERVFAIL looking up TXT for _acme-challenge.service.overlay.sys11cloud.net - the domain's nameservers may be malfunctioning"
.
Perhaps I should link to the conversation on the Let's Encrypt forum: https://community.letsencrypt.org/t/was-there-some-dns-lookup-failure-in-recent-days/188778/7
urn:ietf:params:acme:error:badNonce
is handled automatically by lego which renews the nonce, this error is not a blocking error, it's just a kind of warning.
So, whatever the problem was, it mysteriously went away, at least long enough last friday afternoon when I tried it again.
Given what's said in the thread linked before, I would say that the way LE does lookups is not affected by how LEGO does it. So as far as I can tell now, there is nothing wrong with LEGO.
The only thing that could maybe be improved, is a better description of the --dns.resolvers
option. We made incorrect assumptions about how the given resolvers are used We thought that those would be used to check availability of the TXT record, and that success for all of them would be required. Instead, if I understood it correctly now, they are used to resolve a CNAME for the TXT record (if there is one), and that requires only one working name server. Then the authoritative name servers for the TXT record are found and they are queried directly.
Thank you for keeping us updated!
The only thing that could maybe be improved, is a better description of the --dns.resolvers option.
Agreed. When I'm back home, I'll prepare a PR for this.
This helps explain a situation that I have faced during my last renewal. When verifying the challenge token, I kept getting no response on the TXT lookup and this was due to my ISP blocking the ipv6 from the AAAA record. Confirmed by pinging it locally (no reply) and by doing the same on a remote server (received reply). To bypass this, I added a record in my hosts file to resolve directly to the ipv4. It would be nice to have some additional flexibility. Either: 1. If AAAA record has no response, retry on A record. or 2. Set a setting to force using A or AAAA records if present.
If AAAA record has no response, retry on A record
Lego's DNS resolver library (github.com/miekg/dns) uses a plain Go net.Dialer
to send UDP packets, which follows RFC 6555 ("Fast Fallback") by default.
I suspect ypur ISP blocks IPv6 by simply dropping packets, and not returning ICMP Admin-Prohibited messages, which would throw that of. The solution here is to either find a capable ISP, or to disable IPv6 on your host entirely.
Set a setting to force using A or AAAA records if present
I'd be hesitant to introduce such a config setting, as it would introduce a disparity between the way Lego and Let's encrypt function. This could easily lead to more confusion and make debugging even more tedious.
I suspect the far better solution is to add query logging to the DNS resolver, to see which queries went out, and which answers came back - but then again, tcpdump -ni $IFACE udp port 53
would to the same.
I also started encountering this behavior, out of nowhere, after a year+ of certs renewing automatically without any issue.
I'm using the Docker image, running the command:
lego --path /lego --accept-tos --email public@lowbar.fyi --dns joker --domains 'lab.pins.atomized.org' --domains '*.lab.pins.atomized.org' renew
Logs:
2023/02/21 18:16:55 [INFO] retry due to: acme: error: 400 :: POST :: https://acme-v02.api.letsencrypt.org/acme/authz-v3/198980437317 :: urn:ietf:params:acme:error:badNonce :: JWS has an invalid anti-replay nonce: "1AADws2AaIsRrYBcCtVB7bxCXSW0j6wcJCtdxyp3Lj9JJMA"
2023/02/21 18:16:56 [INFO] Skipping deactivating of valid auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/198980437317
2023/02/21 18:16:56 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/205312993556
2023/02/21 18:16:57 [INFO] Deactivating auth: https://acme-v02.api.letsencrypt.org/acme/authz-v3/205312993566
2023/02/21 18:16:57 error: one or more domains had a problem:
[*.lab.pins.atomized.org] time limit exceeded: last error: NS c.ns.joker.com. did not return the expected TXT record [fqdn: _acme-challenge.lab.pins.atomized.org., value: (redacted)]:
[lab.pins.atomized.org] time limit exceeded: last error: NS a.ns.joker.com. did not return the expected TXT record [fqdn: _acme-challenge.lab.pins.atomized.org., value: (redacted)]:
I have solid IPv4 and IPv6 connectivity. I'll wait a couple hours to make sure I don't run into cached NXDOMAIN and set DESIGNATE_POLLING_INTERVAL=60
, to see what happens. My certs expire in 99 hours, hopefully it'll work again before that. Not thrilled about the situation.
// checkDNSPropagation checks if the expected TXT record has been propagated to all authoritative nameservers. func (p preCheck) checkDNSPropagation(fqdn, value string) (bool, error) { // Initial attempt to resolve at the recursive NS // This could cause negative cache poisoning! r, err := dnsQuery(fqdn, dns.TypeTXT, recursiveNameservers, true) }
So it first publicly and recursively asks for the TXT record. This can cause cache poisoning. Then it finds the authoritative name servers for the TXT record, and then it asks all of those servers if the TXT record is already there.
This line which can cause negative caching seems like the most likely culprit for this intermittent issue, which I'm having with a different DNS provider (not openstack/designate), because My SOA says to cache negative responses for an hour (3600s). I traced the line back to a commit from 7 years ago by Jan, which looks like it was added to bootstrap CNAME resolution. Is there another way that won't cause negative caching?
commit: https://github.com/go-acme/lego/commit/b594acbc2a17f1643568bb62d2ccf8250e444219?diff=split&w=1#
before: https://github.com/go-acme/lego/blob/c97b5a52a13532fe637e26b7bd00ba313fc1c51c/acme/dns_challenge.go#L77-L90 after: https://github.com/go-acme/lego/blob/b594acbc2a17f1643568bb62d2ccf8250e444219/acme/dns_challenge.go#L79-L105
Welcome
What did you expect to see?
A sucessful certificate generation.
What did you see instead?
We had an error message from Let's Encrypt:
2022/11/30 10:24:59 error: one or more domains had a problem: [cloud.syseleven.de] acme: error: 400 :: urn:ietf:params:acme:error:dns :: DNS problem: SERVFAIL looking up TXT for _acme-challenge.cloud.syseleven.de - the domain's nameservers may be malfunctioning",
but the propagation check with 2 name servers apparently passed:With a successful result, before the last line, there would have been a message like
[cloud.syseleven.de] The server validated our request
.Note that we used
4.4.4.4
which was a working public name server in the past, but apparently no longer. It must have stopped working relatively recently. We discovered this while trying to debug this. As of now, it does not respond to any query.However there was no indication from lego that there was a problem and it looks like it accepted the broken server as working, and continued on, as if everything was working.
Even when we replaced 4.4.4.4 with another server, the next attempt failed in the same way.
This makes me think that the propagation check doesn't really work. How else could a random nameserver serve the correct TXT record (I surely hope that this is part of the check, right?) but when Let's Encrypt does the query it fails. I noticed that you get the SERVFAIL error also if the TXT record is simply missing. It seems extremely unlikely that the name servers worked long enough for a query via 8.8.8.8 to work, and then suddenly broke when Let's Encrypt
How do you use lego?
Docker image
Reproduction steps
We use a gitlab CI pipeline to run this command periodically:
lego --accept-tos --dns, designate --path /tmp/lego --dns.resolvers 8.8.8.8 --dns.resolvers", 4.4.4.4 --server=https://acme-v02.api.letsencrypt.org/directory --email noreply@syseleven.de --key-type rsa4096 -d "*.cloud.syseleven.net" -d "*.infra.sys11cloud.net" -d "*.infrabk.sys11cloud.net" -d "*.infrabl.sys11cloud.net -d "*.infrafe.sys11cloud.net" -d "cloud.syseleven.de" renew --preferred-chain "ISRG Root X1"
Version of lego
Logs
See above
Go environment (if applicable)
No response