hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.59k stars 4.63k forks source link

azurerm_key_vault_certificate timeout / retry configuration on creation is too short #12268

Closed alexinthesky closed 2 years ago

alexinthesky commented 3 years ago

Community Note

Terraform (and AzureRM Provider) Version

Terraform v1.0.0 on linux_amd64

Affected Resource(s)

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key: https://keybase.io/hashicorp
resource "azurerm_key_vault_certificate" "le-cert" {
  for_each = { for le in local.les : le.id => le }
  lifecycle {
    ignore_changes = [
      certificate_policy,
      name,
    ]
  }
  name         = "legal-entity-${each.value.id}"
  key_vault_id = "/subscriptions/${var.azure_sub}/resourceGroups/${var.azure_rg}/providers/Microsoft.KeyVault/vaults/${var.azure_kv}"
  certificate_policy {
    issuer_parameters {
      name = "Self"
    }
    key_properties {
      exportable = true
      key_size   = 2048
      key_type   = "RSA"
      reuse_key  = true
    }
    lifetime_action {
      action {
        action_type = "AutoRenew"
      }
      trigger {
        days_before_expiry = 30
      }
    }
    secret_properties {
      content_type = "application/x-pem-file"
    }
    x509_certificate_properties {
      extended_key_usage = ["1.3.6.1.5.5.7.3.1"]
      key_usage = [
        "cRLSign",
        "dataEncipherment",
        "digitalSignature",
        "keyAgreement",
        "keyCertSign",
        "keyEncipherment",
      ]
      subject            = "CN=${each.value.subdomain}"
      validity_in_months = 12
    }
  }
}

Debug Output

│ Error: Error waiting for Certificate "legal-entity-214" in Vault "https://mykv.vault.azure.net/" to become available: couldn't find resource (21 retries) │ │ with azurerm_key_vault_certificate.le-cert["214"], │ on main.tf line 33, in resource "azurerm_key_vault_certificate" "le-cert": │ 33: resource "azurerm_key_vault_certificate" "le-cert" { │ ╵

Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

  1. terraform apply

Important Factoids

here is an extract of the diagnostic logs for one cert creation:

we can see that the CertificateEnroll arrives AFTER all the GET retries done by terraform

OperationName,"id_s","TimeGenerated [UTC]",ResultSignature CertificateEnroll,"https://mykv.vault.azure.net/certificates/legal-entity-214/9667ff50c261492382bf157d7397c934","6/18/2021, 9:26:54.240 AM", CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:26:04.056 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:25:54.004 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:25:43.945 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:25:33.834 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:25:23.742 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:25:13.675 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:25:03.577 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:24:53.468 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:24:43.325 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:24:33.179 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:24:23.019 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:24:12.856 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:24:02.760 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:23:52.632 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:23:42.537 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:23:32.437 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:23:22.358 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:23:12.247 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:23:02.151 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:22:52.054 AM",OK CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:22:37.020 AM",OK CertificateCreate,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:22:36.958 AM",Accepted CertificateGet,"https://mykv.vault.azure.net/certificates/legal-entity-214","6/18/2021, 9:22:36.161 AM","Not Found"

References

alexinthesky commented 3 years ago

adding this block doesnt seem to help: timeouts { create = "60m" read = "60m" delete = "60m" }

a-mcf commented 3 years ago

I'm experiencing the same thing. It seemed to start some time on Thursday morning (US eastern time) as I had deploys working then suddenly start failing without changing the provider version. I was using AzureRM 2.60.0 at the time. Upgrading to 2.64.0 didn't help.

Despite the timeout error, the certificates are created and waiting, but they aren't tracked in state.

nikydobrev commented 3 years ago

I'm experiencing the same issue as well. It seems that is not related to the azurerm provider version. I am currently using "2.58.0".

MariaPhoenix commented 3 years ago

Seeing the same issue since last week, using providers, 2.49 up to 2.64. Seem this may be some api change on the Azure side?

a-mcf commented 3 years ago

This seems to be region based to an extent. We've found that this works fine in US East 2, but US East is slower and fails. Either way, it looks like the timeout should be adjusted.

gk-fschubert commented 3 years ago

This seems to be region based to an extent. We've found that this works fine in US East 2, but US East is slower and fails. Either way, it looks like the timeout should be adjusted.

We've the same issue in West Europe and Germany West Central

mancaus commented 3 years ago

We're experiencing this in North Europe and raised a ticket. We've received confirmation that long running key vault operations in general are taking longer to complete, and that this is expected to last around a month.

alexinthesky commented 3 years ago

agreed with what is said in #12347, the timeout happens 'somewhere else' so we may have something off in the code managing the create or read timeouts in this resource

alexinthesky commented 3 years ago

^^that's my first MR to the provider. how could I get a maintainer to look at it / get the pipelines to run?

jackofallops commented 3 years ago

Hi all - Azure have been rolling out (or possibly back? I'm having trouble getting details...) a patch to KeyVaults that was linked to performance problems in some regions, is this still a problem for folks here? I'll review the linked PR just in case, but hold off on a merge until I know it's still an ongoing issue.

a-mcf commented 3 years ago

@jackofallops - Support told me that they had a hotfix rolling out that was expected to be done by 7/15. Things have been working better for me. That said, if the resource wasn't correctly honoring timeout values and this fixes it, why not merge it regardless?

garretth9 commented 3 years ago

I was able to create several certificates today without seeing the failures we were seeing earlier in US East

jackofallops commented 3 years ago

Hi @a-mcf - It's not that the provider / resource isn't honouring timeouts, it's attempting to deal with eventual consistency of that resource under normal circumstances. The failure is due to the underlying service not performing as intended. Whilst we could allow this, and indeed every resource, its maximum deadline to complete, this would quickly become a time expensive operation. We attempt to balance these resource availability checks against realistic values for success, and then we tend to err on the generous side to be sure. Rather than simply just keep extending tolerances, we need to be mindful of not simply papering over genuine issues in the service. Does that make sense?

alexinthesky commented 3 years ago

Hi I get your point and in the other side, I feel that this nearly works out of luck ( we fall in the default values of 20 checks ), which renders the whole thing a bit fragile to me considering the fluctuation we observe regarding the time azure takes for the creation of some ressources.

tombuildsstuff commented 2 years ago

👋

Since the issue in the upstream Azure API has since been fixed I believe this has been resolved - as such I'm going to close this issue for the moment, but if your still facing this on the latest version of the Provider then please open a new issue and we'll take another look.

Thanks!

github-actions[bot] commented 2 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.