hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.65k forks source link

azurerm_role_definition not fully finished before assignment #19916

Open sterol opened 1 year ago

sterol commented 1 year ago

Is there an existing issue for this?

Community Note

Terraform Version

1.3.4

AzureRM Provider Version

3.35.0

Affected Resource(s)/Data Source(s)

azurerm_role_definition

Terraform Configuration Files

resource "azurerm_role_definition" "user_role" {
    assignable_scopes           = [
      "/subscriptions/00000000-0000-0000-0000-000000000000",
    ]
    description                 = "something"
    id                          = (known after apply)
    name                        = "role 1"
    role_definition_id          = (known after apply)
    role_definition_resource_id = (known after apply)
    scope                       = "/subscriptions/00000000-0000-0000-0000-000000000000"

    permissions {
        actions     = [
            + "Microsoft.Resources/*/action",
           ]
        not_actions = []
      }
}

resource "azurerm_role_assignment" "user_role_assignment" {
    id                               = (known after apply)
    name                             = (known after apply)
    principal_id                     = "00000000-0000-0000-0000-000000000000"
    principal_type                   = (known after apply)
    role_definition_id               = (known after apply)
    role_definition_name             = "role 1"
    scope                            = "/subscriptions/00000000-0000-0000-0000-000000000000"
    skip_service_principal_aad_check = (known after apply)
}

Debug Output/Panic Output

Following error occurs while running terraform apply:
"Error: loading Role Definition List: could not find role 'user_role'
with azurerm_role_assignment.user_role_assignment ..."

Expected Behaviour

This error happens only randomly when I try to apply role_definition and assignment at the same deployment. Immediate next execution of tf apply resolves the issue, the role definition is available now, assignment succeeds. It seems to be a timing issue. The role_definition returns to early for the subsequent role_assignment, which then fails with the mentioned error. Even looking looking ion the Azure dahboard show that the role_definition exists after the first tf apply has finished.

Actual Behaviour

Role assignment randomly fails when definition has not yet fully finished on Azure side.

Steps to Reproduce

Execute tf apply for a plan containing role definition and assignment of the the newly created role.

Important Factoids

No response

References

The issue is already known, see #10602. It was suggested to open a new one and add reference to the former.

epiHATR commented 1 year ago

I faced with the same problem but for Virtual Network resource and I think it's not relates to Terraform itself.

We know that resources are being creating by requesting Azure API via cli or deployment with resource deployment file and retrieving them through Terraform just like another API call though vary API region. It may not available for all region right after resource created on Azure.

In this case, I suggest a wait time null resource to be created after role assignment resources, like:

resource "time_sleep" "wait_1_minute" {
  depends_on = [....previous]
  create_duration = "60s"
}
liuwuliuyun commented 1 year ago

I agree with @epiHATR for the root cause. Adding a reference to depend_on. I believe you could try to use depend_on meta to create an execution logic like azurerm_role_definition -> time_sleep -> azurerm_role_assignment

freewilll commented 2 months ago

I found that when creating and deleting role definitions, the azure CLI inconsistently returns a 404 or 200 after the create/delete.

I have been running terraform (with TF_LOG=DEBUG), while watching the CLI with something like:

watch -n 0.2 'az role definition list --name test'

What I see is randomly alternating responses that either are [] or [{... role definition JSON ... }]. At the same time, I see the HTTP responses from the terraform poll requests come back as either 404 or 200.

When I create a role definition with terraform, the create API returns a 200 and terraform considers the creation done. However, due to the random azure APIs responses, it leads to random failures when trying to create resources that depend on the role definition.

When a deletion is done, the delete API similarly returns a 200 straight away. However, this time, this code kicks in:

Refresh:                   roleDefinitionDeleteStateRefreshFunc(ctx, client, id),
MinTimeout:                10 * time.Second,
ContinuousTargetOccurence: 20,
Timeout:                   time.Until(deadline),

The above code results in terraform waiting until it sees consistent 404s for 200 seconds. In my experience this leads to random waits of around 3, 7, 9 or 12 minutes.

Based on this, I can attribute no fault to terraform. It is unfortunate that the azure APIs cannot return consistent results.