hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.46k stars 4.53k forks source link

Terraform Plan does not settle due to case-sensitivity #21085

Open egorshulga opened 1 year ago

egorshulga commented 1 year ago

Is there an existing issue for this?

Community Note

Terraform Version

1.4.2

AzureRM Provider Version

3.47.0

Affected Resource(s)/Data Source(s)

azurerm_data_protection_backup_instance_blob_storage, azurerm_mssql_server_microsoft_support_auditing_policy

Terraform Configuration Files

(Related to existing resources, see description below)

resource "azurerm_mssql_server_microsoft_support_auditing_policy" "devopsAuditingSettings" {
  server_id             = azurerm_mssql_server.sqlServer.id
  blob_storage_endpoint = azurerm_storage_account.account.primary_blob_endpoint
}

resource "azurerm_data_protection_backup_instance_blob_storage" "storageBackup" {
  name               = azurerm_storage_account.storage.name
  vault_id           = azurerm_data_protection_backup_vault.backup.id
  location           = azurerm_resource_group.resourceGroup.location
  storage_account_id = azurerm_storage_account.storage.id
  backup_policy_id   = azurerm_data_protection_backup_policy_blob_storage.backupRetention.id
}

Debug Output/Panic Output

Terraform will perform the following actions:

  # azurerm_data_protection_backup_instance_blob_storage.storageBackup[0] will be updated in-place
  ~ resource "azurerm_data_protection_backup_instance_blob_storage" "storageBackup" {
      ~ backup_policy_id   = "/subscriptions/.../resourceGroups/Resourcegroup/providers/Microsoft.DataProtection/backupVaults/backup-vault/backupPolicies/retention" -> "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.DataProtection/backupVaults/backup-vault/backupPolicies/retention"
        id                 = "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.DataProtection/backupVaults/backup-vault/backupInstances/storage"
        name               = "storage"
        # (3 unchanged attributes hidden)
    }

  # azurerm_mssql_server_microsoft_support_auditing_policy.devopsAuditingSettings must be replaced
-/+ resource "azurerm_mssql_server_microsoft_support_auditing_policy" "devopsAuditingSettings" {
      ~ id                     = "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Sql/servers/sql-server/devOpsAuditingSettings/Default" -> (known after apply)
      ~ server_id              = "/subscriptions/.../resourceGroups/Resourcegroup/providers/Microsoft.Sql/servers/sql-server" -> "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Sql/servers/sql-server" # forces replacement
        # (3 unchanged attributes hidden)
    }

Expected Behaviour

Terraform Plan should settle down. After running terraform apply subsequent calls to terraform plan should not show any differences.

Actual Behaviour

Terraform Plan does not settle down. Terraform detects some differences. For azurerm_data_protection_backup_instance_blob_storage it is just an update operation, for azurerm_mssql_server_microsoft_support_auditing_policy it is even worse – it is a re-create operation.

Steps to Reproduce

We are migrating from a really large and long-lived ARM template to terraform. We already have 5 environments that we must not re-create (this includes the Prod environment). We have ~200 resources in a NonProd environments and ~240 resources in Prod environments.

Initially, when we implemented the Terraform code and imported all of the resources into state, there were ~200 (effectively all of them) that Terraform wanted to re-create. We analyzed the reason, and we found that it is the Resource Group name which is causing this. It seems it is a well-known issue, that's been reported multiple times, that Azure almost for all of the resources is case-insensitive, while AzureRM is by default case-sensitive. The name of Resource Groups appears in IDs of resources, and a lot of resources simply parse the ID to get the name of their Resource Groups. Additionally, Resource Group names are not consistently stored across different resources in Azure (the best example would be LogAnalytics, which stores the name of its Resource Group lowercased – we observed it across different subscriptions, so we'd assume that this would be true for everyone using Azure).

We found a workaround of manipulating Resource IDs by fixing their Resource Group names on-the-fly while performing initial import. This reduced the amount of operations to ~60 updates (there are still some properties stored on the Azure side which have inconsistent casing) and ~80 re-creations (some of inconsistent properties stored on the Azure side can't be updated without resources re-creation). Luckily among the re-created resources there were none that we can't re-create (so, we are taking the risk of increased downtime during deployment to Prod, and this includes re-creation of these resources: WebTests, ActionGroups, Alerts, RoleAssignments, NGS Associations).

Unfortunately, even after applying workarounds and re-creating resources, we still observe unsettled Terraform Plan. We suspect it is Azure who is (a) refusing to update some properties if they differ only in casing and (b) decides on casing of RG's names for us, so neither we nor AzureRM provider running infrastructure provisioning for us can influence this.

In one of the previous issues I've seen a remark, that we can't easily decide on case-sensitivity, and we should be absolutely sure that this is how Azure behaves (and so it does not introduce issues for thousands of developers). I would like to leave this link to documentation: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/resource-name-rules

image

If you would allow me, I would like to ask these questions (and use them to start a discussion):

(1) What is your view on the casing issues in general?

It seems that when AzureRM and Terraform is used from the very beginning of life of a Resource Group, then handling is much better. Numerous issues arise only when we migrate long-living existing infrastructure.

(2) It seems to be a tremendous amount of work required to identify all case-insensitive properties and make them like that. Although I spotted that some of the properties already use DiffSuppressFunc: suppress.CaseDifference, so I would assume that this work is already in progress. Would you have a plan to tackle all of the casing issues, or do you prefer going on case-by-case basis?

(3) Would you kindly accept contributions to make properties, that are making troubles for us, case-insensitive?

Important Factoids

No response

References

No response

egorshulga commented 1 year ago

While testing on another environment I also spotted another resources that do not settle after terraform apply:

  # azurerm_mssql_server_extended_auditing_policy.auditingSettings must be replaced
-/+ resource "azurerm_mssql_server_extended_auditing_policy" "auditingSettings" {
      ~ id                                      = "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Sql/servers/sql-server/extendedAuditingSettings/Default" -> (known after apply)
      ~ server_id                               = "/subscriptions/.../resourceGroups/Resourcegroup/providers/Microsoft.Sql/servers/sql-server" -> "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Sql/servers/sql-server" # forces replacement
        # (5 unchanged attributes hidden)
    }

  # module.houseProd.azurerm_linux_web_app.app_service will be updated in-place
  ~ resource "azurerm_linux_web_app" "app_service" {
        id                                = "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Web/sites/app-service"
        name                              = "app-service"
      ~ service_plan_id                   = "/subscriptions/.../resourceGroups/Resourcegroup/providers/Microsoft.Web/serverfarms/app-service-plan" -> "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Web/serverfarms/app-service-plan"
        tags                              = {}
        # (18 unchanged attributes hidden)

        # (3 unchanged blocks hidden)
    }
egorshulga commented 1 year ago

Oh, on yet another environment I also discovered this unsettling thing (luckily, that's the second to last env 😅)

  # azurerm_subnet_network_security_group_association.subnet1-nsg[0] must be replaced
-/+ resource "azurerm_subnet_network_security_group_association" "subnet1-nsg" {
      ~ id                        = "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Network/virtualNetworks/vnet/subnets/subnet1" -> (known after apply)
      ~ network_security_group_id = "/subscriptions/.../resourceGroups/Resourcegroup/providers/Microsoft.Network/networkSecurityGroups/nsg" -> "/subscriptions/.../resourceGroups/ResourceGroup/providers/Microsoft.Network/networkSecurityGroups/nsg" # forces replacement
        # (1 unchanged attribute hidden)
    }
ADBjester commented 1 year ago

Hello Egor and @tombuildsstuff. I'd like to report that we, too, are seeing this, and cannot resolve it with mere "matching case".

In our case, it's a NSG Association that is unsettled. Here's the PLAN (edited to anonymize, but to illustrate the issue):

  # azurerm_subnet_network_security_group_association.sql_subnet_assignment must be replaced
-/+ resource "azurerm_subnet_network_security_group_association" "sql_subnet_assignment" {
      ~ id = "/subscriptions/<id>/resourceGroups/SOME-Project-EUS2-DEV-RG/providers/Microsoft.Network/virtualNetworks
          /SOME-Project-EUS2-DEV-VNET_10.432.411.0_22/subnets/some-project-dev-eus2-sql-subnet" -> (known after apply)
      ~ subnet_id = "/subscriptions/<id>/resourceGroups/some-project-eus2-dev-rg/providers/Microsoft.Network
                 /virtualNetworks/SOME-Project-EUS2-DEV-VNET_10.432.411.0_22/subnets/some-project-dev-eus2-sql-subnet" 
                 -> "/subscriptions/<id>/resourceGroups/SOME-Project-EUS2-DEV-RG/providers/Microsoft.Network
                     /virtualNetworks/SOME-Project-EUS2-DEV-VNET_10.432.411.0_22/subnets/some-project-dev-eus2-sql-subnet" # forces replacement
        # (1 unchanged attribute hidden)
    }

Note that it wants to change the case of the resource group name in the subnet_id from all-lower-case to the ORIGINAL-Weird-CASING that our Cloud Enablement team created.

From: .../resourceGroups/some-project-eus2-dev-rg/providers To: .../resourceGroups/SOME-Project-EUS2-DEV-RG/providers

This fails on apply (because the NSG is still associated with a SQL Managed Instance):

2023-06-14T21:28:07.7909957Z Error: removing Network Security Group Association from Subnet: 
(Name "some-project-dev-eus2-sql-subnet" / 
 Virtual Network Name "SOME-Project-EUS2-DEV-VNET_10.432.411.0_22" / 
 Resource Group "SOME-Project-EUS2-DEV-RG"): 
network.SubnetsClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: 
Code="NetworkSecurityGroupCannotBeRemovedDueToNipOnSubnet" 
Message="Network security group cannot be removed from subnet 
    /subscriptions/<id>/resourceGroups/SOME-Project-EUS2-DEV-RG/providers/Microsoft.Network
    /virtualNetworks/SOME-Project-EUS2-DEV-VNET_10.432.411.0_22/subnets/some-project-dev-eus2-sql-subnet 
because it has network intent policy 
    /subscriptions/<id>/resourceGroups/SOME-Project-EUS2-DEV-RG/providers/Microsoft.Network
    /networkIntentPolicies/mi_default_<guid_and_ip_address> applied." 
Details=[]
2023-06-14T21:28:07.7911487Z 
2023-06-14T21:28:07.8257005Z ##[error]Bash exited with code '1'.
2023-06-14T21:28:07.8269639Z ##[section]Finishing: Terraform Apply

Now, my position is that this ought not be being replaced (and that Egor's PR 21115 would have corrected this).

Now, I've edited directly into the state files, and determined that what we have stored is the CORRECT-Weird-CASING. I wish I could insert an image to show you, but you'll have to just take my word for it that the following is a copy/paste from the state file, straight out of Azure Blob Storage where it is stored (again, edited only for anonymity... but I didn't change the case of the RG name, other than to anonymize it):

    {
      "mode": "managed",
      "type": "azurerm_subnet",
      "name": "sql_subnet",
      "provider": "provider[\"registry.terraform.io/hashicorp/azurerm\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "address_prefixes": [
              "10.432.411.192/27"
            ],
            "delegation": [
              {
                "name": "sqlDelegation",
                "service_delegation": [
                  {
                    "actions": [
                      "Microsoft.Network/virtualNetworks/subnets/join/action",
                      "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
                      "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
                    ],
                    "name": "Microsoft.Sql/managedInstances"
                  }
                ]
              }
            ],
            "enforce_private_link_endpoint_network_policies": true,
            "enforce_private_link_service_network_policies": false,
            "id": "/subscriptions/<id>/resourceGroups/SOME-Project-EUS2-DEV-RG/providers/Microsoft.Network
                   /virtualNetworks/SOME-Project-EUS2-DEV-VNET_10.432.411.0_22/subnets/some-project-dev-eus2-sql-subnet",
            "name": "some-project-dev-eus2-sql-subnet",
            "private_endpoint_network_policies_enabled": false,
            "private_link_service_network_policies_enabled": true,
            "resource_group_name": "SOME-Project-EUS2-DEV-RG",
            "service_endpoint_policy_ids": [],
            "service_endpoints": [
              "Microsoft.AzureActiveDirectory",
              "Microsoft.Storage"
            ],
            "timeouts": null,
            "virtual_network_name": "SOME-Project-EUS2-DEV-VNET_10.432.411.0_22"
          },
          "sensitive_attributes": [],
          "private": "<redacted>"
        }
      ]
    },

Note that the "id" of the sql_subnet is ALREADY in the CORRECT-Weird-CASING in the state file:

"id": "/subscriptions//resourceGroups/SOME-Project-EUS2-DEV-RG/providers....

I find it significant that this is not something that we control, but is actually reading directly out of state -- it is just one "field" as part of the id of the subnet involved. If we go visit Azure Portal -> Virtual Networks -> this VNet -> This Subnet, we cannot even see the resource id for the subnet. That's buried, assigned by ARM at the time of resource creation.... and it is CORRECT in the state file.

But when "plan" comes along, something as part of the plan process has lower-cased it, and the plan THINKS that it has to be right-cased via replacement.... which can't happen due to the SQL Managed Instance behind it.

We really don't want to have to destroy and re-create this managed instance. It would take DAYS, between provisioning the new managed instance, and restoring backups.... without a certainty of outcome.

Can you please revisit this bug from the above perspective?

Thanks!

Jeff Woods Azure Architect Oaks, PA

ADBjester commented 1 year ago

@tombuildsstuff @egorshulga I see #22070 was recently also submitted, in response to #20138. That PR might also resolve this issue, or at least lead in that general direction. Tom assigned @mbfrahry back to this just six hours ago, after the original PR requestor made some requested changes. Here's hoping....

ADBjester commented 1 year ago

I am wondering if a change like the following, in subnet_network_security_group_association_resource.go on ~ line 178 might resolve it. It is based on the PR in 22070, but in a different module. I don't have a Go compiler installed (and couldn't anyway due to company policy), and don't know Go well enough to be confident in my usage of EqualFold.... but the essence should be the same.... if the originally passed in subnet ID is case-insensitive identical, just use the one passed in, which will bypass any case comparison issues that might cause a regeneration of the resource. If, however, there are "significant" differences (i.e. non-case-sensitive ones), set the subnet ID to the new one (which WOULD be indicative of it being moved, and needing a teardown and replacement).

    // d.Set("subnet_id", resp.ID)

    if (strings.EqualFold(id.ResourceGroup, resp.ID)) {
        d.Set("subnet_id", props.ResourceGroup)
    } else {
        d.Set("subnet_id", resp.ID)
    }

    d.Set("network_security_group_id", securityGroup.ID)

    return nil
}
ADBjester commented 1 year ago

@egorshulga FYI, I managed to get past my issue (I still believe its a bug) with a judiciously-placed:

lifecycle {
    ignore_changes = [
      subnet_id
    ]
  }

It means I can't ever, ever change the underlying subnet... but the presence of the SQL managed instance in it guarantees that, since the MI applies the network intent policy to keep the subnet dedicated to nothing but the managed instance anyway, so while it is still a bug in AzureRM, I can live with it since I have no desire to tear down the SQL managed instance.

Still hope HashiCorp fixes it. In the meantime, you can probably circumvent casing issues (at least, resource group names embedded in resource ID paths) in the same manner.

Jeff Woods Azure Architect Oaks, PA

egorshulga commented 1 year ago

We were pretty much lucky, as unsettling resources were more or less transient, there were no data resources, so we went the way of dropping and recreating them. For example, for NSG Associations we recreated appropriate NSGs.

And that was the most huge risk in our migration to Terraform. We were keeping our eyes on the stuff we were recreating, so we wouldn't have dropped something that we must not drop

eehret commented 8 months ago

This is happening to me too right now, on a azurerm_private_endpoint , seemingly due to the fact that the case of the resource ID of the azurerm_storage_account it's referring to in the private_connection_resource_id is being returned in a different case than the state contains. I've sometimes been able to work around this by removing and reimporting resources with the altered case, but in this case I can't do that.

The only way I can see to "fix" this right now is to do an ugly case conversion on the fly.

DhanushBL commented 8 months ago

@eehret I also have the exact same issue in private endpoints for my function apps during dns zone mapping. Here the ID keeps changing even after multiple apply. Im tired of doing these change so as a last resort i added in ignore_changes and i know this is the worst option because i can't change the dns zone later for private endpoints.

Please give some proper workarounds for this. Why azure and terraform are the worst combinations, So many things are outdated like LogicApps API Connections kind 2 latest api version is 2016 and like that many others.

PrakashRajanSakthivel commented 1 month ago

I have started facing this issue while using the front door. The only change is

/subscriptions/sub_id/resourceGroups/resource_group_name/providers/Microsoft.Cdn/profiles/resource_name" /subscriptions/sub_id/resourcegroups/resource_group_name/providers/Microsoft.Cdn/profiles/resource_name"

and everytime its trying to replace the origin groups, origins and endpoints.