Terraform incorrectly creates empty state file on azurerm backend

alicyn commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Community Note

Please vote on this issue by adding a :thumbsup: reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.

Terraform Version

1.4.6

AzureRM Provider Version

3.30.0

Affected Resource(s)/Data Source(s)

backend

Terraform Configuration Files

backend.tf:

terraform {
  backend "azurerm" {
    resource_group_name  = "RG_Name"
    storage_account_name = "storage_account_name"
    container_name       = "tfstate"
    key                  = "aci/storydashboard/terraform.tfstate"
  }
}

Terraform Apply GitHub Actions Workflow:

permissions:
  id-token: write # required to use OIDC authentication
  contents: read # required to checkout the code from the repo
  pull-requests: write # required for posting comments to issues

name: deploy

on:
  pull_request:
    types: [ labeled ]

jobs:
  deploy-aci:
    name: deploy
    runs-on: ubuntu-latest
    environment: deployments
    env:
      ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
      ARM_SUBSCRIPTION_ID: ${{ secrets.ARM_SUBSCRIPTION_ID }}
      ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
      ARM_USE_OIDC: true
    defaults:
      run:
        working-directory: aci/terraform/azure-container-instances

    steps:
    - name: Checkout aci terraform repo
      uses: actions/checkout@v3
      with:
        repository: myorg/azure-feature-environments
        path: aci
        token: ${{ secrets.ACI_REPOSITORY_ACCESS_TOKEN }}

    - name: Setup terraform
      uses: hashicorp/setup-terraform@v2
      with:
        terraform_version: 1.4.6

    - name: Terraform Init
      id: init
      run: terraform init -input=false

    - name: Terraform Validate
      id: validate
      run: terraform validate -no-color

    - name: Terraform Plan
      id: plan
      run: terraform plan -lock=false -out=tf_apply.plan -no-color -input=false

    - name: Terraform Apply
      run: terraform apply -lock=false -input=false "tf_apply.plan"

And the subsequent Terraform Destroy GitHub Actions Workflow

permissions:
  id-token: write # required to use OIDC authentication
  contents: read # required to checkout the code from the repo

name: destroy

on:
  schedule:
    # * is a special character in YAML so you have to quote this string
    - cron:  '0 0 * * *' #0 is UTC --> 8PM Eastern

jobs:

  destroy:
    runs-on: ubuntu-22.04
    environment: deployments
    env:
      ARM_CLIENT_ID: ${{ secrets.ARM_CLIENT_ID }}
      ARM_SUBSCRIPTION_ID: ${{ secrets.ARM_SUBSCRIPTION_ID }}
      ARM_TENANT_ID: ${{ secrets.ARM_TENANT_ID }}
      ARM_USE_OIDC: true
    defaults:
      run:
        working-directory: aci/terraform/azure-container-instances

    steps: 
    - name: Checkout aci terraform repo
      uses: actions/checkout@v3
      with:
        repository: myorg/azure-feature-environments
        path: aci
        token: ${{ secrets.ACI_REPOSITORY_ACCESS_TOKEN }}

    - name: Setup terraform
      uses: hashicorp/setup-terraform@v2
      with:
        terraform_version: 1.4.6

    - name: Terraform Init
      id: init
      run: terraform init -input=false

    - name: Terraform Validate
      id: validate
      run: terraform validate -no-color

    - name: Terraform Plan Destroy
      id: plan
      run: terraform plan -lock=false -destroy -out=tf_destroy.plan -no-color -input=false

    - name: Terraform Destroy
      run: terraform apply -lock=false "tf_destroy.plan"

Debug Output/Panic Output

Output from GitHub Actions Run:

Run terraform init -input=false
  terraform init -input=false
  shell: /usr/bin/bash -e {0}
  env:
    ARM_CLIENT_ID: ***
    ARM_SUBSCRIPTION_ID: ***
    ARM_TENANT_ID: ***
    ARM_USE_OIDC: true
    TF_VAR_branch_name: storydashboard
    TERRAFORM_CLI_PATH: /home/runner/work/_temp/8f9a3950-8c88-45b3-a781-dfc7cf888c56
/home/runner/work/_temp/8f9a3950-8c88-45b3-a781-dfc7cf888c56/terraform-bin init -input=false

Initializing the backend...

Successfully configured the backend "azurerm"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- terraform.io/builtin/terraform is built in to Terraform
- Finding hashicorp/azurerm versions matching "3.30.0"...
- Finding hashicorp/azuread versions matching "~> 2.0"...
- Installing hashicorp/azurerm v3.30.0...
- Installed hashicorp/azurerm v3.30.0 (signed by HashiCorp)
- Installing hashicorp/azuread v2.39.0...
- Installed hashicorp/azuread v2.39.0 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init" in the future.

Terraform has been successfully initialized!
------
Run terraform plan -lock=false -destroy -out=tf_destroy.plan -no-color -input=false
  terraform plan -lock=false -destroy -out=tf_destroy.plan -no-color -input=false
  shell: /usr/bin/bash -e {0}
  env:
    ARM_CLIENT_ID: ***
    ARM_SUBSCRIPTION_ID: ***
    ARM_TENANT_ID: ***
    ARM_USE_OIDC: true
    TF_VAR_branch_name: storydashboard
    TERRAFORM_CLI_PATH: /home/runner/work/_temp/8f9a3950-8c88-45b3-a781-dfc7cf888c56
/home/runner/work/_temp/8f9a3950-8c88-45b3-a781-dfc7cf888c56/terraform-bin plan -lock=false -destroy -out=tf_destroy.plan -no-color -input=false

No changes. No objects need to be destroyed.

Either you have not created any objects yet or the existing objects were
already deleted outside of Terraform.

Expected Behaviour

Upon executing Terraform Plan and Apply, Terraform should write a terraform.tfstate file to the Azure storage blob at the specified path aci/storydashboard/terraform.tfstate, and this file contains a long list of objects that are tracking the objects created in Azure (in this case, an application gateway, azure container instance group, log analytics workspace, public ip, etc).

When Terraform Destroy runs after a successful Terraform Apply completed, it should read a valid state file from the remote backend and produce the following plan:

Plan: 0 to add, 0 to change, 7 to destroy.

Changes to Outputs:
  - appgateway_fqdn                 = "https://ops-example-preview.mydomain.net/" -> null
  - appgateway_public_ip            = "redacted" -> null
  - container_group_ip              = "redacted" -> null
  - current_client_config           = (sensitive value) -> null
  - env_vars                        = (sensitive value) -> null
  - git_branch_name                 = "ops-example-preview" -> null
  - secret_value                    = (sensitive value) -> null
  - storage_access_key_secret_value = (sensitive value) -> null

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tf_destroy.plan

To perform exactly these actions, run the following command to apply:
    terraform apply "tf_destroy.plan"`

Actual Behaviour

Terraform writes the following terraform.tfstate file to the Azure storage blob which has an empty resources list:

{
  "version": 4,
  "terraform_version": "1.4.6",
  "serial": 6,
  "lineage": "8c2dfd7d-f7ad-dd39-55aa-b125e65f2b7e",
  "outputs": {},
  "resources": [],
  "check_results": null
}

Therefore when Terraform Destroy runs on GitHub Actions, it reports that there are no objects, and nothing gets deleted. it's important to note that the incorrect/empty Terraform state problems only started happening recently, since about 2-3 weeks ago.

Steps to Reproduce

This issue appears to occur intermittently. See above configuration.

Important Factoids

Running Terraform in CI Automation

References

No response

mybayern1974 commented 1 year ago

@alicyn thank you for opening this issue. Given this repo focuses on resolving bugs/feature-requests to Terraform AzurerRM provider, and handling Terraform integration with CICD (say Github Action) as well as Terraform storing state files to remote locations could be out of support of this repo, could you please provide minimum required steps to repro the state-file-containing-no-resource-info issue? By saying "minimum required", it includes your native Terraform config files as well as executing that locally w/ removing the factors of CICD (Github Action) and Terraform remote storage.

If things can repro after removing distractors as above then we can help take a further look, otherwise for example if eventually it can be confirmed to be the Github Action config issue or other broader usage issues, then this repo might not be the right place to ask but suggest to file issues to the community forum.

justinmchase commented 7 months ago

Hello, any progress on this issue? We have been noticing this as well. If the initial deployment fails at any point the statefile appears empty even if there is a ton of resources created. This does not appear to be a problem after the first deployment succeeds.

justinmchase commented 7 months ago

@mybayern1974 This does appear to be a legit issue, not related to github actions.

Do you happen to have any insight into whos reponsibility it would be for writing the resources to the state file? Would this be the provider or terraform itself?

mybayern1974 commented 7 months ago

@justinmchase , this symptom might be regarded as being similar as #24517. Firstly, to directly reply your ask: it's the AzureRM provider that controls writing back to the state file. The current design of this provider is, for most resources, as long as the Azure resource provisioning API returns error, the provider writes nothing back to the state file by assuming resource creation failed. The provider has no knowledge to tell whether that resource would eventually be "silently" provisioned successfully, or partially successfully.

With the above facts of this provider, my 2 cents are: I can intuitively think of 4 situations:

Backend resource creation was successful, and its API returns a success to the provider;
Backend resource creation got totally failed ending up nothing provisioned, and its API returns an error to the provider;
Backend resource creation was failed while leaving a resource entity in the backend which is not functional, and its API returns an error to the provider. To remedy that, users need to recreate a new resource or call an update action to the existing non-functional resource.
Backend resource creation is async successful eventually, while its API returns an error to the provider.

Currently this provider is able to handle (1) and (2). (4) should be an API bug that needs to be addressed by the upstream cloud service, so even if this provider behaves abnormally due to it, it should not be this provider to fix. (3) might be the case you are mentioning, and currently this provider is handling it as (2). There is a PR #24520 trying to mitigate the similar issue #24517. The intent of that PR is to always write back to state files even if meeting API returning error (though I feel that PR could be tuned a bit to double confirm resource existing before writting back to states, anyway), and meanwhile mark the recorded state of that resource as tainted, then leave how to follow up with that damaged resource to end users. This TF doc has more on the tf tainted story. How would the above PR #24517 move forward depends on Hashicorp maintainers' opinion so I may suggest you can subscribe that thread. Given this could be a cross provider topic, how that PR moves forward could become a guide on responding to other similar issues including this issue.

justinmchase commented 7 months ago

My backend in this case is the default local file system for reference.

I haven't done a full root cause analysis on this yet so its hard to say but I will add that in our cases we have resources from other providers that are failing (kubernetes, null_resource) when the issue is produced. We also have azure resources which are applied first but when a later error happens they appear to be not recorded in the state file and then the second deployment fails because the azure resources already exist.

I'll try to come up with a minimal repro but something like below is an example of what I'm talking about above.

resource "azurerm_resource_group" "example" {
  name     = "example"
  location = "centralus"
}

resource "null_resource" "example" {
  depends_on = [azurerm_resource_group.example]
  triggers = {
    always_run = "${timestamp()}"
  }

  provisioner "local-exec" {
    command = <<-EOF
      set -eo pipefail
      exit 1
    EOF
  }
}

Now when I say "later" and "depends_on", I'm not 100% on that, maybe they are in parallel and it will be looking at point (4) in your analysis closely because I'm wondering if because these resources are in parallel with some azure resources it may cause all azure resources to not serialize. I wonder if doing all of one provider then all of another, without any parallel overlap would help?

hashicorp / terraform-provider-azurerm