vancluever / terraform-provider-acme

Terraform ACME provider
https://registry.terraform.io/providers/vancluever/acme/latest
Mozilla Public License 2.0
225 stars 74 forks source link

Bug: DNS provider configuration not passed correctly to lego when renewing multiple certs in parallell #235

Closed Laffs2k5 closed 1 year ago

Laffs2k5 commented 2 years ago

Hi,

We are experiencing an issue which, after some thorough digging, we believe to be specifically related to how this provider interacts with lego;

We experience that when we renew more than one cert in one run/parallel one of the DNS challenge configurations "wins" and is used for all certs being renewed. This results in renewal of all but one cert failing.

We are using the Azure DNS provider but as far as I can understand this should happen for all projects that have multiple certs renewed at once, where the lego DNS provider configuration differs between the certs.

In our specific case it is the AZURE_ZONE_NAME configuration that differs between certs. An example of how this fails is our last run in our DEV environment (logs from this run further down):

Normally one would suspect that this behavior is due to some bug in the DNS provider, but after reviewing debug-log and reading up on how to use their CLI (https://go-acme.github.io/lego/usage/cli/), I'm leaning towards this issue being caused by how lego is invoked from the Terraform ACME Provider. I was about to attempt to reproduce our issue with the CLI when it occurred to me that it would be impossible to have two values in an environment variable at once. The lego CLI supports multiple instances of --domains value on the command line but DNS provider configuration is expected to be defined in environment variables. Thus the only way of running two (or more) renewals in parallel from the command line would be to spawn two (or more) individual lego processes each with their own set of DNS provider configuration.

Based on the above my suspicion is that lego is not invoked with "enough separation" by the Terraform ACME Provider. I haven't looked into exactly how the lego library is invoked but I wouldn't be surprised if there is some set of data structures or environment variables that are shared between the two lego processes...

I've enabled debug logging from provider and terraform locally and manually verified that the dns_challenge actually differs between our various certs.

Please let me know if there is any more information I can supply to help you debug this issue ;-)

Our terraform cert resource loop:

resource "acme_certificate" "certificate" {
  for_each                  = local.dns_zones_with_gateway_config 
  account_key_pem           = tls_private_key.private_key.private_key_pem
  common_name               = each.key
  subject_alternative_names = ["*.${each.key}"]
  key_type                  = "4096" # RSA, 4096 bits
  min_days_remaining        = 60

  dns_challenge {
    provider = "azure"
    config = {
      AZURE_RESOURCE_GROUP = azurerm_resource_group.common_network_rg.name
      AZURE_CLIENT_ID     = data.azuread_service_principal.terraform_user.application_id
      AZURE_CLIENT_SECRET = data.azurerm_key_vault_secret.terraform_user_password.value
      AZURE_SUBSCRIPTION_ID = data.azurerm_client_config.current.subscription_id
      AZURE_TENANT_ID       = data.azurerm_client_config.current.tenant_id

      AZURE_ZONE_NAME = each.key
    }
  }
}

Relevant terraform output from failure in DEV environment:

module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["dev.farligeprodukter.no"]: Destroying... [id=https://REDACTED.vault.azure.net/certificates/cert-dev-farligeprodukter-no-9f0fc806-REDACTED/REDACTED]
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["saksbehandler.dev.eksplosiver.no"]: Destroying... [id=https://REDACTED.vault.azure.net/certificates/cert-saksbehandler-dev-eksplosiver-no-62672dce-REDACTED/REDACTED]
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["dev.farligeprodukter.no"]: Still destroying... [id=https://REDACTED.vault.a...82680/REDACTED, 10s elapsed]
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["saksbehandler.dev.eksplosiver.no"]: Still destroying... [id=https://REDACTED.vault.a...75efb/REDACTED, 10s elapsed]
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["dev.farligeprodukter.no"]: Still destroying... [id=https://REDACTED.vault.a...82680/REDACTED, 20s elapsed]
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["saksbehandler.dev.eksplosiver.no"]: Still destroying... [id=https://REDACTED.vault.a...75efb/REDACTED, 20s elapsed]
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["dev.farligeprodukter.no"]: Destruction complete after 22s
module.common_network.azurerm_key_vault_certificate.key_vault_web_certificate["saksbehandler.dev.eksplosiver.no"]: Destruction complete after 22s
module.common_network.acme_certificate.certificate["saksbehandler.dev.eksplosiver.no"]: Modifying... [id=62672dce-REDACTED]
module.common_network.acme_certificate.certificate["dev.farligeprodukter.no"]: Modifying... [id=9f0fc806-REDACTED]
module.common_network.acme_certificate.certificate["saksbehandler.dev.eksplosiver.no"]: Still modifying... [id=62672dce-REDACTED, 10s elapsed]
module.common_network.acme_certificate.certificate["dev.farligeprodukter.no"]: Still modifying... [id=9f0fc806-REDACTED, 10s elapsed]
module.common_network.acme_certificate.certificate["dev.farligeprodukter.no"]: Still modifying... [id=9f0fc806-REDACTED, 20s elapsed]
module.common_network.acme_certificate.certificate["saksbehandler.dev.eksplosiver.no"]: Still modifying... [id=62672dce-REDACTED, 20s elapsed]
module.common_network.acme_certificate.certificate["saksbehandler.dev.eksplosiver.no"]: Modifications complete after 23s [id=62672dce-REDACTED]
module.common_network.acme_certificate.certificate["dev.farligeprodukter.no"]: Still modifying... [id=9f0fc806-REDACTED, 30s elapsed]
...
module.common_network.acme_certificate.certificate["dev.farligeprodukter.no"]: Still modifying... [id=9f0fc806-REDACTED, 4m10s elapsed]
╷
│ Error: error: one or more domains had a problem:
│ [*.dev.farligeprodukter.no] time limit exceeded: last error: NS ns4-04.azure-dns.info. did not return the expected TXT record [fqdn: _acme-challenge.dev.farligeprodukter.no., value: REDACTED]: 
│ [dev.farligeprodukter.no] time limit exceeded: last error: NS ns4-04.azure-dns.info. did not return the expected TXT record [fqdn: _acme-challenge.dev.farligeprodukter.no., value: REDACTED]: 
│ 
│ 
│   with module.common_network.acme_certificate.certificate["dev.farligeprodukter.no"],
│   on common/common-network/certificates.tf line 21, in resource "acme_certificate" "certificate":
│   21: resource "acme_certificate" "certificate" {
│ 
╵
Error: Process completed with exit code 1.
vancluever commented 2 years ago

@Laffs2k5 I suspect your assessment is correct here, I haven't reproduced but seeing as there should only be one provider instance running it's definitely plausible parallel renewals would cause nondeterminism with how environment variables are set.

Unfortunately AFAIK, unless something has changed, the primary way to set DNS provider configuration in lego, even through its library API, is via environment variables. We'll have to investigate more and think of a workaround or fix.

vancluever commented 2 years ago

@Laffs2k5 PS: For now, you should be able to work around this in Terraform itself by disabling parallelism, ie: -parallelism=1, see https://www.terraform.io/cli/commands/apply#parallelism-n.

Laffs2k5 commented 2 years ago

@Laffs2k5 PS: For now, you should be able to work around this in Terraform itself by disabling parallelism, ie: -parallelism=1, see https://www.terraform.io/cli/commands/apply#parallelism-n.

Thanks. I'll look into using this. May not be a feasible solution for our project in the long run, boils down to what the increase in total execution time is. Either way I'll give it a go just to make sure this avoids the problem. Thanks for the quick response and a great provider 😄

Laffs2k5 commented 2 years ago

Can confirm that parallelism=1 worked around the problem. It is not feasible in the long run though as apply took a long long time for our project.

I see that it is possible to pass the DNS provider configuration as files (via environment variables) to lego:

The environment variable names can be suffixed by _FILE to reference a file instead of a value. More information here.

source: https://go-acme.github.io/lego/dns/azure/

Probably not any help to resolve this issue though 🤔 The result would be the same, that all invocations of lego would use the same set of files...

itspngu commented 2 years ago

Ran into a very similar issue using Google Cloud DNS, in my case I was creating certificates with a shared CN and an overlapping (but slightly different) list of SANs from a module that is used in 2 instances (staging/production). I failed to get a copy of the debug logs before "solving" the issue with -parallelism=1 but essentially I saw Google's API return a 409: Exists as the provider apparently tried to create 2 records for _acme_challenge.<CN> simultaneously. I've not had this issue in about a year of using the provider, seems weird?

vancluever commented 2 years ago

Just thought I'd mention I haven't forgotten about this!

Some notes though: just checking the provider and it seems that there is a NewDNSProviderConfig function in a number of the DNS providers in lego. The main issue is that this, or NewDNSProvider for that matter, are not interfaces, rather than just patterns. Additionally, the documented configuration for each provider is dependent on environment variables as most providers pull their configuration from env.GetOrDefault..., which means that configuration provided via NewDNSProviderConfig would not read the configuration from something like, say, a map.

Basically this would affect some of our automation and scalability in the repository now as we now automate the lego updates and DNS provider factory generation. Unfortunately it would be too much of a lift for me to have to keep up with every provider manually (ie: manually map each config variable to the internal struct values).

It could be neat to do something like in lego maybe move the env.GetOrDefault... functions to instead work off of annotated configuration structs that that could utilize either a supplied key/value map, a file, or the environment (roughly in that order). That would be a bit of a lift at lego, but would be good for all downstream dependencies that wanted to stop depending on either the environment or local files.

plemelin commented 2 years ago

Dropping a note here in the hopes of directing people a little bit better when researching this issue.

As of the beginning of October, we started seeing failures when trying to renew our certificates with DNSMadeEasy for multiple domains. We renew two sets of certificates in parallel (.XYZ.domain.com) and (.lan.XYZ.domain.com). One set would already be updated and the second would not. In our case the role would fail on the challenge verification.

Edit: This was working for us for the past 13 months and only started failing in October.

The DNSMadeEasy API is crashing with a very unhelpful 404. Setting TF_LOG to TRACE will not make lego print the requests:

dnsmadeeasy: unable to create record for XZY: request failed with HTTP status code 404:
<html><head><title>Apache Tomcat/7.0.12 - Error report</title><style><!--H1 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:16px;} H3 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} BODY {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;} P {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> </head><body><h1>HTTP Status 404 - Not Found</h1><HR size="1" noshade="noshade"><p><b>type</b> Status report</p><p><b>message</b> <u>Not Found</u></p><p><b>description</b> <u>The requested resource (Not Found) is not available.</u></p><HR size="1" noshade="noshade"><h3>Apache Tomcat/7.0.12</h3></body></html>

The DNSMadeEasy API documentation has this to say about the 404:

This is a generic error response code which indicates an error in the request issued, typically formatting or syntax related.

Took me way too long to figure that the TXT record was deleting right after the first successful renewal. Setting -parallelism=1 to the apply command line solved the issue.

vancluever commented 2 years ago

@plemelin thanks for the data, knowing that more and more people are having to apply the workaround is giving me more ammo to work on addressing it. I think now that most of the smaller issues are worked out (save imports) I will shift my attention to it. Again it's a bit of a big lift as we're talking about making fundamental changes to lego, but let's see what happens.

plemelin commented 2 years ago

I'll add to this issues as I have discovered some additional faulty behaviors with other configurations

I'm generating 3 set of certificates:

When using ACME 2.11.1, the challenge sent to EasyDNS fails with a 403 because the acme_challenge that is sent is for XYZ.domain.com but it should be for otherdomain.com.

When using ACME 2.10.0, the challenge sent to EasyDNS contains the correct otherdomain.com but is unable to find the TXT record for some reason.

Same setup worked for another domain (with 3 certificate) I configured on Sept 29th.

We may be hitting 2 issues here. The clash in environment variables at LEGO level and there seems to be something with ACME itself.

vancluever commented 1 year ago

Hey folks! I have a fix for this I think! If anyone wants to try it out you can see the changelog for v2.13.0-beta1 and include that manually in Terraform by using a version pin for the version explicitly - see here.

This fix will allow the ACME provider to deliver a fix to this without necessarily burdening lego with a lot of extra work, or at least it takes the pressure off of what would likely be a large lift to fix every provider to be able to be configurable manually using the same name key/value pairs you see in the environment.

If anyone gives it a go let me know! One thing to keep in mind too is that I'm still working on adding provider timeout discovery support - see #277.

itspngu commented 1 year ago

If anyone gives it a go let me know!

I'll try to get around to it this week. Thank you!

vancluever commented 1 year ago

Hey folks,

v2.13.0-beta2 should be out now with the provider timeout support.

This pretty much sums up the work for this fix so if anyone can do some more testing for me and confirm fixes, that'd be awesome!

vancluever commented 1 year ago

Also, I did a test of the beta provider in TFC today and all is well :+1:

This was pretty much the last check for me before we go live with this fix.

I'm going to let v2.13.0-beta2 canary for a few days before releasing it full. If anyone tries it out and ends up seeing issues please let me know! Also I'd love to hear if this ultimately fixed the issue for you as well!

Laffs2k5 commented 1 year ago

Thank you so much for your efforts! I will most definitely try this out in our dev environment. Will let you know when I have something to share :-)

Laffs2k5 commented 1 year ago

v2.13.0-beta2 seems to fix our issue 🥳 I've been able to renew 13 certs in our dev environment without specifying -parallelism=1 , ie. with the default parallelism which is 10. Thanks once again!

vancluever commented 1 year ago

@Laffs2k5 amazing! I'll give the beta another day to breathe then and then release it. Thanks again for confirming!

vancluever commented 1 year ago

Hey folks,

Sorry it's a few days later than mentioned but v2.13.0 is now out! I'm going to close this out now as it has been resolved. Thanks for everyone's patience!