Zone apex Subject alternative name not working with wildcards

wirepatch commented 2 weeks ago

Welcome

[X] Yes, I'm using a binary release within 2 latest releases.
[X] Yes, I've searched similar issues on GitHub and didn't find any.
[X] Yes, I've included all information below (version, config, etc).

What did you expect to see?

The issue has also been reported at https://github.com/vancluever/terraform-provider-acme/issues/419 . The maintainer referred me to this site replicating my original bug report.

I was creating a wildcard certificate adding a domain apex using Terraform:

resource "acme_certificate" "certificate" {
  account_key_pem           = acme_registration.registration.account_key_pem
  common_name               = "*.goik.sdi.hdm-stuttgart.cloud"
  subject_alternative_names = ["goik.sdi.hdm-stuttgart.cloud"]

  dns_challenge {
    provider = "rfc2136"
     ...
  }
  depends_on = [acme_registration.registration]
}

I was expecting a valid certificate being created.

What did you see instead?

Certificate generation fails. When omitting the subject_alternative_names = ... entry everything works fine. See my detailed bind name server log analysis at https://github.com/vancluever/terraform-provider-acme/issues/419

How do you use lego?

Through Terraform ACME provider

Reproduction steps

Defining above mentioned resource "acme_certificate" "certificate" {...}.
Executing terraform apply

The challenge fails due to not using two separate zones related to the wildcard and the apex as being described at https://github.com/vancluever/terraform-provider-acme/issues/419 .

Version of lego

Sorry but I'm using the latest acme Terraform provider unable to execute the lego binary explicitly.

Logs

DNS Bind 9 logs:

```console ... updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': deleting rrset at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT ... updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': adding an RR at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT "9xRJx_tyhCOIY-17tpZQZOi608d8yZMd03xJgQA6Gio" ... updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': deleting rrset at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT ... updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': adding an RR at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT "a4GltWzN7vA4QgXs_55dpetl5x5nt2aHsYhTYoKMSvQ" ```

Terraform execution result:

``` ... acme_certificate.certificate: Still creating... [1m10s elapsed] acme_certificate.certificate: Still creating... [1m20s elapsed] ╷ │ Error: error creating certificate: error: one or more domains had a problem: │ [*.goik.sdi.hdm-stuttgart.cloud] propagation: time limit exceeded: last error: NS ns1.goik.sdi.hdm-stuttgart.cloud. did not return the expected TXT record [fqdn: _acme-challenge.goik.sdi.hdm-stuttgart.cloud., value: 9xRJx_tyhCOIY-17tpZQZOi608d8yZMd03xJgQA6Gio]: a4GltWzN7vA4QgXs_55dpetl5x5nt2aHsYhTYoKMSvQ │ ... ```

Go environment (if applicable)

```console $ go version && go env # paste output here ```

ldez commented 2 weeks ago

Hello,

did not return the expected TXT record [fqdn: _acme-challenge.goik.sdi.hdm-stuttgart.cloud., value: 9xRJx_tyhCOIY-17tpZQZOi608d8yZMd03xJgQA6Gio]: a4GltWzN7vA4QgXs_55dpetl5x5nt2aHsYhTYoKMSvQ

This feels like a DNS propagation issue: some TXT records are absent when Let's Encrypt checks the records.

The problem is not directly related to SAN (FYI CN is considered deprecated) but to the need to create and propagate several TXT records.

For me, it's neither a lego problem nor terraform-provider-acme problem but something related to your DNS: the propagation seems very slow.

ldez commented 2 weeks ago

~~Wait a minute :thinking: Your DNS logs are unexpected.~~

EDIT: I was surprised by the DNS logs: I thought that the 4 logs were at the same time. But there is no timestamp for the logs.

ldez commented 2 weeks ago

The rfc2136 implementation is sequential, so lego will try to handle the challenge domain by domain, not at the same time. So this is not related to the availability of several TXT records, but purely to the DNS propagation:

the first TXT record is created, the challenge happens, and then the TXT record is removed.
the second TXT record is created, the challenge happens, and then the TXT record is removed.

But when LE asks for the second TXT record, the first TXT record is still here, because the propagation of the previous actions (delete, creation) is not done.

So same conclusion, a DNS propagation issue, the propagation seems very slow.

More details:

  common_name                =  "*.goik.sdi.hdm-stuttgart.cloud"
  subject_alternative_names  =  ["goik.sdi.hdm-stuttgart.cloud"]

A wildcard domain and the "base domain" will request the creation of TXT records with the same name:

Domain	TXT record name
`*.goik.sdi.hdm-stuttgart.cloud`	`_acme-challenge.goik.sdi.hdm-stuttgart.cloud.`
`goik.sdi.hdm-stuttgart.cloud`	`_acme-challenge.goik.sdi.hdm-stuttgart.cloud.`

This is different from two non-wildcard domains:

Domain	TXT record name
`goik.sdi.hdm-stuttgart.cloud`	`_acme-challenge.goik.sdi.hdm-stuttgart.cloud.`
`wwww.goik.sdi.hdm-stuttgart.cloud`	`_acme-challenge.wwww.goik.sdi.hdm-stuttgart.cloud.`

In the context (wildcard + "base domain") the propagation delay is important because of this name overlap.

wirepatch commented 2 weeks ago

EDIT: I was surprised by the DNS logs: I thought that the 4 logs were at the same time. But there is no timestamp for the logs.

The DNS updates on the server side happen within one second. Complete log without truncation:

``` Jun 14 19:21:28 sdiservice named[28361]: client @0x7f234acbd168 217.245.243.187#48172/key goik.key: updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': deleting rrset at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT Jun 14 19:21:28 sdiservice named[28361]: client @0x7f234acbd168 217.245.243.187#48172/key goik.key: updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': adding an RR at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT "JcFY2gug0IP9SAbOYCA6lrxbgilQr-YjpcVZiPDu9d0" Jun 14 19:21:28 sdiservice named[28361]: client @0x7f234926d168 217.245.243.187#35971/key goik.key: updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': deleting rrset at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT Jun 14 19:21:28 sdiservice named[28361]: client @0x7f234926d168 217.245.243.187#35971/key goik.key: updating zone 'goik.sdi.hdm-stuttgart.cloud/IN': adding an RR at '_acme-challenge.goik.sdi.hdm-stuttgart.cloud' TXT "mCiuV5VbdfCmT4CkdyvQFh5whtFRbDTqEK1DeARbv7s" ```

I do understand your conclusion about propagation times. But when using dig @8.8.8.8 ... the entries are visible quite instantaneous after the above bind log entries show up. I'd say within two seconds at max. And dig only shows the second TXT entry value from above. terraform apply then continues for more than a minute until finally failing.

ldez commented 2 weeks ago

The DNS updates on the server side happen within one second.

This doesn't change my conclusion because the log message did not return the expected TXT record.

This can only change something if the Terraform provider tries to overcome the sequential behavior, but I don't think so.

But when using dig @8.8.8.8 ... the entries are visible quite instantaneous after the above bind log entries show up. I'd say within two seconds at max.

2 seconds it's not super slow but not super fast too (if the DNS changes happen within one second).
LE uses its own set of DNS to check the propagation (we don't know this list of DNS).

This tool can help to check the propagation: https://unboundtest.com/

FYI, I don't know how the Terraform provider works, I just know how lego works.

For example, I don't know if you are using a custom DNS resolver: https://go-acme.github.io/lego/usage/cli/options/#dns-resolvers-and-challenge-verification

wirepatch commented 2 weeks ago

Thx for your swift reply and the detailed. I consider myself as a terraform user. I'll try your DNS resolver hint due to indeed using a delegation to a custom DNS server for the zone in question.

vancluever commented 2 weeks ago

@wirepatch just FYI for next time when submitting an issue here (as mentioned in the referral doc) you'll want to replicate the issue with the lego CLI as since @ldez mentioned, they don't work on the TF provider, so it's important that any reproductions are done in the tools they are responsible for - this helps rule out issues with the provider as well. Most TF configurations can be replicated with the CLI.

@ldez thanks for the help on this! Looking over this deeper and looking at your replies here, funny enough, I wonder if I found the issue. We have a wrapper provider for the DNS providers that allows folks to configure multiple providers, but it does not implement sequential. Do you think that might be the culprit? Sounds like in order to implement this properly we'd have to probe through our wrapper and make an opinionated decision on whether or not parallel solve was possible depending on the results from all providers in the set. What do you think?

wirepatch commented 2 weeks ago

Thx for the lego CLI link. I'm not sure however if all Terraform based scenarios are indeed easy to replicate with respect to timing issues: I tried a workaround handling the wildcard and apex zone separately forcing their respective certificate creations in sequence using depends_on:

resource "acme_certificate" "certificateWild" {
  ...
  common_name   =  "*.goik.sdi.hdm-stuttgart.cloud"

  dns_challenge {
    provider = "rfc2136"    ... 
  }
  depends_on  =  [acme_registration.registration]
}

resource "acme_certificate" "certificateApex" {
  ...
  common_name = "goik.sdi.hdm-stuttgart.cloud"
  dns_challenge {...}
  depends_on    =  [acme_certificate.certificateWild]
}

To my surprise this doesn't work either most likely because of timing / TTL issues.

Being just a Terraform user I may lack deeper (DNS) knowledge being required for the given topic(s). But am happy to follow your test proposals. Besides that you probably do have more than enough resources for testing. However if you feel so inclined I'll send required DNS bind HMAC keys for testing my particular Hetzner setup making logs accessible as well.

ldez commented 2 weeks ago

We have a wrapper provider for the DNS providers that allows folks to configure multiple providers, but it does not implement sequential. Do you think that might be the culprit?

@vancluever Based on the error and the DNS logs, it can be the problem: the sequential behavior is here for providers that don't support multiple TXT records for the same domain (it's for the case wildcard + base domain).

Those kinds of providers can only manage one DNS record at a time for a domain.

Sounds like in order to implement this properly we'd have to probe through our wrapper and make an opinionated decision on whether or not parallel solve was possible depending on the results from all providers in the set. What do you think?

You should either apply the "sequential behavior" on the wrapper (but you will slow down all the providers) or handle 2 clients (one for sequential, one for parallel).

vancluever commented 2 weeks ago

@ldez thanks!

You should either apply the "sequential behavior" on the wrapper (but you will slow down all the providers) or handle 2 clients (one for sequential, one for parallel).

Yeah, I don't think it's a big deal to apply to the whole wrapper mainly because I'm pretty sure the multi-provider scenario is an edge case. So if for some reason one provider is sequential and the other is parallel, I don't think it's a huge deal if both become sequential.

I think there's enough information here to rule out lego at this time too, so feel free to close this and I'll handle it over on the provider side. Thanks again! :slightly_smiling_face:

go-acme / lego