Open sterol opened 1 year ago
I faced with the same problem but for Virtual Network resource and I think it's not relates to Terraform itself.
We know that resources are being creating by requesting Azure API via cli or deployment with resource deployment file and retrieving them through Terraform just like another API call though vary API region. It may not available for all region right after resource created on Azure.
In this case, I suggest a wait time null resource to be created after role assignment resources, like:
resource "time_sleep" "wait_1_minute" {
depends_on = [....previous]
create_duration = "60s"
}
I agree with @epiHATR for the root cause. Adding a reference to depend_on. I believe you could try to use depend_on meta to create an execution logic like azurerm_role_definition -> time_sleep -> azurerm_role_assignment
I found that when creating and deleting role definitions, the azure CLI inconsistently returns a 404 or 200 after the create/delete.
I have been running terraform (with TF_LOG=DEBUG
), while watching the CLI with something like:
watch -n 0.2 'az role definition list --name test'
What I see is randomly alternating responses that either are []
or [{... role definition JSON ... }]
. At the same time, I see the HTTP responses from the terraform poll requests come back as either 404 or 200.
When I create a role definition with terraform, the create API returns a 200 and terraform considers the creation done. However, due to the random azure APIs responses, it leads to random failures when trying to create resources that depend on the role definition.
When a deletion is done, the delete API similarly returns a 200 straight away. However, this time, this code kicks in:
Refresh: roleDefinitionDeleteStateRefreshFunc(ctx, client, id),
MinTimeout: 10 * time.Second,
ContinuousTargetOccurence: 20,
Timeout: time.Until(deadline),
The above code results in terraform waiting until it sees consistent 404s for 200 seconds. In my experience this leads to random waits of around 3, 7, 9 or 12 minutes.
Based on this, I can attribute no fault to terraform. It is unfortunate that the azure APIs cannot return consistent results.
Is there an existing issue for this?
Community Note
Terraform Version
1.3.4
AzureRM Provider Version
3.35.0
Affected Resource(s)/Data Source(s)
azurerm_role_definition
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
This error happens only randomly when I try to apply role_definition and assignment at the same deployment. Immediate next execution of tf apply resolves the issue, the role definition is available now, assignment succeeds. It seems to be a timing issue. The role_definition returns to early for the subsequent role_assignment, which then fails with the mentioned error. Even looking looking ion the Azure dahboard show that the role_definition exists after the first tf apply has finished.
Actual Behaviour
Role assignment randomly fails when definition has not yet fully finished on Azure side.
Steps to Reproduce
Execute tf apply for a plan containing role definition and assignment of the the newly created role.
Important Factoids
No response
References
The issue is already known, see #10602. It was suggested to open a new one and add reference to the former.