When removing a host from cluster and decommissioning it in the same change, the decommission is attempted first, and fails

simeon-aladjem commented 9 months ago

Code of Conduct

[X] I have read and agree to the Code of Conduct.
[ ] Vote on this issue by adding a 👍 reaction to the original issue initial description to help the maintainers prioritize.
[ ] Do not leave "+1" or other comments that do not add relevant information or questions.
[ ] If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Terraform

v1.7.3

Terraform Provider

v0.8.1

VMware Cloud Foundation

4.5.2

Description

I have TF file 1 domain, 2 clusters, 4 hosts in cluster #1 and 3 hosts in cluster #2. I remove a host from cluster #1 and also remove the host resource itself.

The plan looks like that:

  # vcf_domain.wld1 will be updated in-place
  ~ resource "vcf_domain" "wld1" {
        id                       = "dc34fa4e-0fc3-49a7-83bf-b95876528936"
        name                     = "sfo-w01-vc01"
        # (5 unchanged attributes hidden)

      ~ cluster {
            id                        = "a9263830-fd6d-41e0-b2a3-add395f39c68"
            name                      = "sfo-w01-cl01"
            # (6 unchanged attributes hidden)

          ~ host {
              ~ id          = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> "f926f406-2d1d-4ddd-baef-d5e39866376f"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "f926f406-2d1d-4ddd-baef-d5e39866376f" -> "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          ~ host {
              ~ id          = "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87" -> "2da8538e-9434-438d-94bd-aabe4ef1fbdb"
                # (1 unchanged attribute hidden)

                # (2 unchanged blocks hidden)
            }
          - host {
              - id          = "2da8538e-9434-438d-94bd-aabe4ef1fbdb" -> null
              - license_key = (sensitive value) -> null

              - vmnic {
                  - id       = "vmnic0" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
              - vmnic {
                  - id       = "vmnic1" -> null
                  - vds_name = "sfo-w01-cl01-vds01" -> null
                }
            }

            # (2 unchanged blocks hidden)
        }

        # (2 unchanged blocks hidden)
    }

  # vcf_host.host5 will be destroyed
  # (because vcf_host.host5 is not in configuration)
  - resource "vcf_host" "host5" {
      - fqdn            = "esxi-5.vrack.vsphere.local" -> null
      - id              = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> null
      - network_pool_id = "b9dc86fd-7074-4adf-b23a-02be8a7c8962" -> null
      - password        = (sensitive value) -> null
      - status          = "ASSIGNED" -> null
      - storage_type    = "VSAN" -> null
      - username        = "root" -> null
    }

The host resource removal (decommissioning) is attempted first, and fails because it is not removed from the cluster yet.

vcf_host.host5: Destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd]
vcf_host.host5: Still destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd, 10s elapsed]
vcf_host.host5: Still destroying... [id=f1668aa8-ffa4-4351-a61f-4248b98196bd, 20s elapsed]
│
│ Error: Task with ID = dfc83a78-69c5-4788-b941-ca11b8b42c5e , Name: "Decommissioning host(s) esxi-5.vrack.vsphere.local from VMware Cloud Foundation" Type: "HOST_DECOMMISSION" is in state Failed
│
│
│

Affected Resources or Data Sources

resources/vcf_domain resources/vcf_host resources/vcf_cluster

Terraform Configuration

N/A

Debug Output

│ │ Error: Task with ID = dfc83a78-69c5-4788-b941-ca11b8b42c5e , Name: "Decommissioning host(s) esxi-5.vrack.vsphere.local from VMware Cloud Foundation" Type: "HOST_DECOMMISSION" is in state Failed │ │ │

Panic Output

No response

Expected Behavior

The host removal from the cluster should happen first, and then the resource should be destroyed

Actual Behavior

The host resource removal (decommissioning) is attempted first, and fails because it is not removed from the cluster yet.

Steps to Reproduce

Create a TF plan with 4 vcf_host resource and 1 vcf_domain, which cluster includes those 4 hosts
Apply the plan and wait for the VCF WLD to be created
Remove from the plan one of the vcf_host resources and also remove the reference to it from the domain's cluster

Apply the plan. It will attempt the following change:

# vcf_domain.wld1 will be updated in-place
~ resource "vcf_domain" "wld1" {
    id                       = "dc34fa4e-0fc3-49a7-83bf-b95876528936"
    name                     = "sfo-w01-vc01"
    # (5 unchanged attributes hidden)

  ~ cluster {
        id                        = "a9263830-fd6d-41e0-b2a3-add395f39c68"
        name                      = "sfo-w01-cl01"
        # (6 unchanged attributes hidden)

      ~ host {
          ~ id          = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> "f926f406-2d1d-4ddd-baef-d5e39866376f"
            # (1 unchanged attribute hidden)

            # (2 unchanged blocks hidden)
        }
      ~ host {
          ~ id          = "f926f406-2d1d-4ddd-baef-d5e39866376f" -> "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87"
            # (1 unchanged attribute hidden)

            # (2 unchanged blocks hidden)
        }
      ~ host {
          ~ id          = "3e5dcf58-ab0b-4a8f-8ec0-8b8fd3ba8c87" -> "2da8538e-9434-438d-94bd-aabe4ef1fbdb"
            # (1 unchanged attribute hidden)

            # (2 unchanged blocks hidden)
        }
      - host {
          - id          = "2da8538e-9434-438d-94bd-aabe4ef1fbdb" -> null
          - license_key = (sensitive value) -> null

          - vmnic {
              - id       = "vmnic0" -> null
              - vds_name = "sfo-w01-cl01-vds01" -> null
            }
          - vmnic {
              - id       = "vmnic1" -> null
              - vds_name = "sfo-w01-cl01-vds01" -> null
            }
        }

        # (2 unchanged blocks hidden)
    }

    # (2 unchanged blocks hidden)
}

# vcf_host.host5 will be destroyed
# (because vcf_host.host5 is not in configuration)
- resource "vcf_host" "host5" {
  - fqdn            = "esxi-5.vrack.vsphere.local" -> null
  - id              = "f1668aa8-ffa4-4351-a61f-4248b98196bd" -> null
  - network_pool_id = "b9dc86fd-7074-4adf-b23a-02be8a7c8962" -> null
  - password        = (sensitive value) -> null
  - status          = "ASSIGNED" -> null
  - storage_type    = "VSAN" -> null
  - username        = "root" -> null
}

The first operation to be attempted will be destruction of the vcf_host, and it will fail

Environment Details

No response

Screenshots

No response

References

No response

spacegospod commented 9 months ago

@simeon-aladjem

I'm not sure if it is possible to force terraform to first update the cluster resource and only afterwards attempt to destroy the host resource. While we investigate please apply the operations separately as a workaround

simeon-aladjem commented 8 months ago

I'm not sure if it is possible to force terraform to first update the cluster resource and only afterwards attempt to destroy the host resource. While we investigate please apply the operations separately as a workaround

Hi @stoyanzhelyazkov , When we commission hosts and add them to a cluster in the same plan, it is done in the correct order, i.e. (1) host commissioning and (2) adding commissioned hosts to a cluster. Also, when we create a workload domain with additional cluster with the same plan, it is also done in the correct order - (1) create the domain and then (2) create additional cluster. How do we enforce the order in those cases?

Moreover, cluster appears as "depending on" the hosts in the .tfstate file:

{
      "mode": "managed",
      "type": "vcf_domain",
      "name": "wld1",
      "provider": "provider[\"registry.terraform.io/vmware/vcf\"]",
      "instances": [
        {
          "schema_version": 0,
          "attributes": {
            "cluster": [
                . . .
            ],
          }        
          "sensitive_attributes": [],
          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjoxNDQwMDAwMDAwMDAwMCwiZGVsZXRlIjozNjAwMDAwMDAwMDAwLCJyZWFkIjoxMjAwMDAwMDAwMDAwLCJ1cGRhdGUiOjE0NDAwMDAwMDAwMDAwfX0=",
          "dependencies": [
            "vcf_host.host11",
            "vcf_host.host6",
            "vcf_host.host7"
          ]
        }
      ]
    },

in other words, even terraform should be smart enough to perform first host removal from the cluster, and then the deletion of the host resource. But even if terraform isn't that smart, shouldn't the provider be able to perform the operation in the right order?

tenthirtyam commented 8 months ago

If you use a depends_on in the configuration, it should remove the host from the cluster prior to removing the host in the same run.

simeon-aladjem commented 8 months ago

According to what I read in the documentation and in several blogs, depends_on is not recommended because of side effects: https://itnext.io/beware-of-depends-on-for-modules-it-might-bite-you-da4741caac70 https://developer.hashicorp.com/terraform/language/meta-arguments/depends_on#processing-and-planning-consequences

To summarise what I have learned: When creating resources, Terraform manages to do it in the right order because of the implicit dependencies, like:

resource "vcf_cluster" "wld1-cluster2" {
  name      = "sfo-w01-cl02"
  domain_id = vcf_domain.wld1.id # (1)
  host {
    id = vcf_host.host8.id # (2)
    . . .
  }
  . . .
}

Because of (1) above, TF knows to create first the domain before creating the cluster. Because of (2) above, TF knows to create (commission) the host first and only then create/update the cluster.

When destroying resources, though, it seems like TF doesn't consider the dependency. Is it a Terraform issue, then?

github-actions[bot] commented 6 months ago

'Marking this issue as stale due to inactivity. This helps us focus on the active issues. If this issue receives no comments in the next 30 days it will automatically be closed.

If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!'

github-actions[bot] commented 4 months ago

I'm going to lock this issue because it has been closed for 30 days. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

vmware / terraform-provider-vcf