Closed wirepatch closed 2 weeks ago
Hello,
did not return the expected TXT record [fqdn: _acme-challenge.goik.sdi.hdm-stuttgart.cloud., value: 9xRJx_tyhCOIY-17tpZQZOi608d8yZMd03xJgQA6Gio]: a4GltWzN7vA4QgXs_55dpetl5x5nt2aHsYhTYoKMSvQ
This feels like a DNS propagation issue: some TXT records are absent when Let's Encrypt checks the records.
The problem is not directly related to SAN (FYI CN is considered deprecated) but to the need to create and propagate several TXT records.
For me, it's neither a lego
problem nor terraform-provider-acme
problem but something related to your DNS: the propagation seems very slow.
Wait a minute :thinking: Your DNS logs are unexpected.
EDIT: I was surprised by the DNS logs: I thought that the 4 logs were at the same time. But there is no timestamp for the logs.
The rfc2136 implementation is sequential, so lego will try to handle the challenge domain by domain, not at the same time. So this is not related to the availability of several TXT records, but purely to the DNS propagation:
But when LE asks for the second TXT record, the first TXT record is still here, because the propagation of the previous actions (delete, creation) is not done.
So same conclusion, a DNS propagation issue, the propagation seems very slow.
More details:
common_name = "*.goik.sdi.hdm-stuttgart.cloud"
subject_alternative_names = ["goik.sdi.hdm-stuttgart.cloud"]
A wildcard domain and the "base domain" will request the creation of TXT records with the same name:
Domain | TXT record name |
---|---|
*.goik.sdi.hdm-stuttgart.cloud |
_acme-challenge.goik.sdi.hdm-stuttgart.cloud. |
goik.sdi.hdm-stuttgart.cloud |
_acme-challenge.goik.sdi.hdm-stuttgart.cloud. |
This is different from two non-wildcard domains:
Domain | TXT record name |
---|---|
goik.sdi.hdm-stuttgart.cloud |
_acme-challenge.goik.sdi.hdm-stuttgart.cloud. |
wwww.goik.sdi.hdm-stuttgart.cloud |
_acme-challenge.wwww.goik.sdi.hdm-stuttgart.cloud. |
In the context (wildcard + "base domain") the propagation delay is important because of this name overlap.
EDIT: I was surprised by the DNS logs: I thought that the 4 logs were at the same time. But there is no timestamp for the logs.
The DNS updates on the server side happen within one second. Complete log without truncation:
I do understand your conclusion about propagation times. But when using dig @8.8.8.8 ...
the entries are visible quite instantaneous after the above bind log entries show up. I'd say within two seconds at max. And dig
only shows the second
TXT entry value from above. terraform apply
then continues for more than a minute until finally failing.
The DNS updates on the server side happen within one second.
This doesn't change my conclusion because the log message did not return the expected TXT record
.
This can only change something if the Terraform provider tries to overcome the sequential behavior, but I don't think so.
But when using dig @8.8.8.8 ... the entries are visible quite instantaneous after the above bind log entries show up. I'd say within two seconds at max.
This tool can help to check the propagation: https://unboundtest.com/
FYI, I don't know how the Terraform provider works, I just know how lego works.
For example, I don't know if you are using a custom DNS resolver: https://go-acme.github.io/lego/usage/cli/options/#dns-resolvers-and-challenge-verification
Thx for your swift reply and the detailed. I consider myself as a terraform user. I'll try your DNS resolver hint due to indeed using a delegation to a custom DNS server for the zone in question.
@wirepatch just FYI for next time when submitting an issue here (as mentioned in the referral doc) you'll want to replicate the issue with the lego CLI as since @ldez mentioned, they don't work on the TF provider, so it's important that any reproductions are done in the tools they are responsible for - this helps rule out issues with the provider as well. Most TF configurations can be replicated with the CLI.
@ldez thanks for the help on this! Looking over this deeper and looking at your replies here, funny enough, I wonder if I found the issue. We have a wrapper provider for the DNS providers that allows folks to configure multiple providers, but it does not implement sequential
. Do you think that might be the culprit? Sounds like in order to implement this properly we'd have to probe through our wrapper and make an opinionated decision on whether or not parallel solve was possible depending on the results from all providers in the set. What do you think?
Thx for the lego CLI link. I'm not sure however if all Terraform based scenarios are indeed easy to replicate with respect to timing issues: I tried a workaround handling the wildcard and apex zone separately forcing their respective certificate creations in sequence using depends_on
:
resource "acme_certificate" "certificateWild" {
...
common_name = "*.goik.sdi.hdm-stuttgart.cloud"
dns_challenge {
provider = "rfc2136" ...
}
depends_on = [acme_registration.registration]
}
resource "acme_certificate" "certificateApex" {
...
common_name = "goik.sdi.hdm-stuttgart.cloud"
dns_challenge {...}
depends_on = [acme_certificate.certificateWild]
}
To my surprise this doesn't work either most likely because of timing / TTL issues.
Being just a Terraform user I may lack deeper (DNS) knowledge being required for the given topic(s). But am happy to follow your test proposals. Besides that you probably do have more than enough resources for testing. However if you feel so inclined I'll send required DNS bind HMAC keys for testing my particular Hetzner setup making logs accessible as well.
We have a wrapper provider for the DNS providers that allows folks to configure multiple providers, but it does not implement sequential. Do you think that might be the culprit?
@vancluever Based on the error and the DNS logs, it can be the problem: the sequential behavior is here for providers that don't support multiple TXT records for the same domain (it's for the case wildcard + base domain).
Those kinds of providers can only manage one DNS record at a time for a domain.
Sounds like in order to implement this properly we'd have to probe through our wrapper and make an opinionated decision on whether or not parallel solve was possible depending on the results from all providers in the set. What do you think?
You should either apply the "sequential behavior" on the wrapper (but you will slow down all the providers) or handle 2 clients (one for sequential, one for parallel).
@ldez thanks!
You should either apply the "sequential behavior" on the wrapper (but you will slow down all the providers) or handle 2 clients (one for sequential, one for parallel).
Yeah, I don't think it's a big deal to apply to the whole wrapper mainly because I'm pretty sure the multi-provider scenario is an edge case. So if for some reason one provider is sequential and the other is parallel, I don't think it's a huge deal if both become sequential.
I think there's enough information here to rule out lego at this time too, so feel free to close this and I'll handle it over on the provider side. Thanks again! :slightly_smiling_face:
Welcome
What did you expect to see?
The issue has also been reported at https://github.com/vancluever/terraform-provider-acme/issues/419 . The maintainer referred me to this site replicating my original bug report.
I was creating a wildcard certificate adding a domain apex using Terraform:
I was expecting a valid certificate being created.
What did you see instead?
Certificate generation fails. When omitting the
subject_alternative_names = ...
entry everything works fine. See my detailed bind name server log analysis at https://github.com/vancluever/terraform-provider-acme/issues/419How do you use lego?
Through Terraform ACME provider
Reproduction steps
Defining above mentioned
resource "acme_certificate" "certificate" {...}
.Executing
terraform apply
The challenge fails due to not using two separate zones related to the wildcard and the apex as being described at https://github.com/vancluever/terraform-provider-acme/issues/419 .
Version of lego
Logs
DNS Bind 9 logs:
Terraform execution result:
Go environment (if applicable)