Open alicyn opened 1 year ago
@alicyn thank you for opening this issue. Given this repo focuses on resolving bugs/feature-requests to Terraform AzurerRM provider, and handling Terraform integration with CICD (say Github Action) as well as Terraform storing state files to remote locations could be out of support of this repo, could you please provide minimum required steps to repro the state-file-containing-no-resource-info issue? By saying "minimum required", it includes your native Terraform config files as well as executing that locally w/ removing the factors of CICD (Github Action) and Terraform remote storage.
If things can repro after removing distractors as above then we can help take a further look, otherwise for example if eventually it can be confirmed to be the Github Action config issue or other broader usage issues, then this repo might not be the right place to ask but suggest to file issues to the community forum.
Hello, any progress on this issue? We have been noticing this as well. If the initial deployment fails at any point the statefile appears empty even if there is a ton of resources created. This does not appear to be a problem after the first deployment succeeds.
@mybayern1974 This does appear to be a legit issue, not related to github actions.
Do you happen to have any insight into whos reponsibility it would be for writing the resources to the state file? Would this be the provider or terraform itself?
@justinmchase , this symptom might be regarded as being similar as #24517. Firstly, to directly reply your ask: it's the AzureRM provider that controls writing back to the state file. The current design of this provider is, for most resources, as long as the Azure resource provisioning API returns error, the provider writes nothing back to the state file by assuming resource creation failed. The provider has no knowledge to tell whether that resource would eventually be "silently" provisioned successfully, or partially successfully.
With the above facts of this provider, my 2 cents are: I can intuitively think of 4 situations:
Currently this provider is able to handle (1) and (2).
(4) should be an API bug that needs to be addressed by the upstream cloud service, so even if this provider behaves abnormally due to it, it should not be this provider to fix.
(3) might be the case you are mentioning, and currently this provider is handling it as (2). There is a PR #24520 trying to mitigate the similar issue #24517. The intent of that PR is to always write back to state files even if meeting API returning error (though I feel that PR could be tuned a bit to double confirm resource existing before writting back to states, anyway), and meanwhile mark the recorded state of that resource as tainted
, then leave how to follow up with that damaged resource to end users. This TF doc has more on the tf tainted
story.
How would the above PR #24517 move forward depends on Hashicorp maintainers' opinion so I may suggest you can subscribe that thread. Given this could be a cross provider topic, how that PR moves forward could become a guide on responding to other similar issues including this issue.
My backend in this case is the default local file system for reference.
I haven't done a full root cause analysis on this yet so its hard to say but I will add that in our cases we have resources from other providers that are failing (kubernetes, null_resource) when the issue is produced. We also have azure resources which are applied first but when a later error happens they appear to be not recorded in the state file and then the second deployment fails because the azure resources already exist.
I'll try to come up with a minimal repro but something like below is an example of what I'm talking about above.
resource "azurerm_resource_group" "example" {
name = "example"
location = "centralus"
}
resource "null_resource" "example" {
depends_on = [azurerm_resource_group.example]
triggers = {
always_run = "${timestamp()}"
}
provisioner "local-exec" {
command = <<-EOF
set -eo pipefail
exit 1
EOF
}
}
Now when I say "later" and "depends_on", I'm not 100% on that, maybe they are in parallel and it will be looking at point (4) in your analysis closely because I'm wondering if because these resources are in parallel with some azure resources it may cause all azure resources to not serialize. I wonder if doing all of one provider then all of another, without any parallel overlap would help?
Is there an existing issue for this?
Community Note
Terraform Version
1.4.6
AzureRM Provider Version
3.30.0
Affected Resource(s)/Data Source(s)
backend
Terraform Configuration Files
And the subsequent Terraform Destroy GitHub Actions Workflow
Debug Output/Panic Output
Output from GitHub Actions Run:
Expected Behaviour
Upon executing Terraform Plan and Apply, Terraform should write a terraform.tfstate file to the Azure storage blob at the specified path
aci/storydashboard/terraform.tfstate
, and this file contains a long list of objects that are tracking the objects created in Azure (in this case, an application gateway, azure container instance group, log analytics workspace, public ip, etc).When Terraform Destroy runs after a successful Terraform Apply completed, it should read a valid state file from the remote backend and produce the following plan:
Actual Behaviour
Terraform writes the following terraform.tfstate file to the Azure storage blob which has an empty resources list:
Therefore when Terraform Destroy runs on GitHub Actions, it reports that there are no objects, and nothing gets deleted. it's important to note that the incorrect/empty Terraform state problems only started happening recently, since about 2-3 weeks ago.
Steps to Reproduce
This issue appears to occur intermittently. See above configuration.
Important Factoids
Running Terraform in CI Automation
References
No response