hashicorp / terraform-provider-vsphere

Terraform Provider for VMware vSphere
https://registry.terraform.io/providers/hashicorp/vsphere/
Mozilla Public License 2.0
621 stars 452 forks source link

Deleting larger datastores with `vsphere_vmfs_datastore` times out #2249

Open ianc769 opened 2 months ago

ianc769 commented 2 months ago

Community Guidelines

Terraform

1.8.5

Terraform Provider

2.8.2

VMware vSphere

8.0.2.00300

Description

Possibly related #417

Deleting vms that have larger disks, In our case 600GBx2, the operation times out since vSphere takes longer than 30s to delete them.

Perhaps having a customizable wait timer would be a good route for this resource, as vSphere seems to be a little iffy on timing.

Affected Resources or Data Sources

resources/vsphere_vmfs_datastore

Possibly this code:

https://github.com/hashicorp/terraform-provider-vsphere/blob/0a41cd630a3c308f789249ddc97f2653c8fd66e2/vsphere/resource_vsphere_vmfs_datastore.go#L407-L415

Terraform Configuration

resource "vsphere_virtual_machine" "pg_db" {
  for_each                = var.pg_db_servers
  name                    = each.key
  annotation              = each.value.annotation
  num_cpus                = each.value.num_cpus
  memory                  = each.value.memory
  folder                  = vsphere_folder.pg_db_folder.path
  datastore_id            = vsphere_vmfs_datastore.pg_data_datastore[each.key].id
  resource_pool_id        = data.terraform_remote_state.core_state.outputs.vsphere_compute_cluster.resource_pool_id
  guest_id                = data.vsphere_virtual_machine.rhel8.guest_id
  scsi_type               = data.vsphere_virtual_machine.rhel8.scsi_type
  tags                    = toset([data.terraform_remote_state.core_state.outputs.rhel_tag.id, data.terraform_remote_state.core_state.outputs.tf_tag.id])
  cpu_hot_add_enabled     = true
  memory_hot_add_enabled  = true
  firmware                = "efi"
  efi_secure_boot_enabled = true

  network_interface {
    network_id = each.value.network_id
  }
  disk {
    label       = "${each.key}.vmdk"
    size        = local.os_volume_size
    unit_number = 0
  }
  disk {
    label       = "${each.key}-data.vmdk"
    size        = each.value.pgdatasize
    unit_number = 1
  }
  disk {
    label        = "${each.key}-pg.vmdk"
    size         = each.value.pgpgsize
    unit_number  = 2
    datastore_id = vsphere_vmfs_datastore.pg_pg_datastore[each.key].id
  }

  clone {
    template_uuid = data.vsphere_virtual_machine.rhel8.id

    customize {
      linux_options {
        host_name = each.key
        domain    = each.value.linux_domain
      }
      dns_server_list = each.value.dns_server_list
      network_interface {
        ipv4_address = each.value.ipv4_address
        ipv4_netmask = each.value.ipv4_netmask
      }
      ipv4_gateway = each.value.ipv4_gateway

    }
  }
  connection {
    type     = "ssh"
    host     = self.default_ip_address
    user     = var.packer_user
    password = var.packer_pass
  }

  lifecycle {
    ignore_changes = [
      disk,
      resource_pool_id,
      clone[0],
      ept_rvi_mode,
      hv_mode
    ]
  }

  depends_on = [
    vsphere_vmfs_datastore.pg_data_datastore,
    vsphere_vmfs_datastore.pg_pg_datastore
  ]

}

Debug Output

https://gist.github.com/ianc769/b6fa08135736fa5db09ff49e5c85bd11

Panic Output

No response

Expected Behavior

Full Deletion with no issues.

Actual Behavior

Error Presenting if we hit the 30s mark.

Steps to Reproduce

Add a VM with a few large disks. Then delete them via Terraform.

Environment Details

No response

Screenshots

image

References

417

github-actions[bot] commented 2 months ago

Hello, ianc769! 🖐

Thank you for submitting an issue for this provider. The issue will now enter into the issue lifecycle.

If you want to contribute to this project, please review the contributing guidelines and information on submitting pull requests.

ianc769 commented 2 months ago

Workaround:

tenthirtyam commented 2 months ago

Have you tried setting the api_timeout for the provider?

provider "vsphere" {
  user                 = var.vsphere_user
  password             = var.vsphere_password
  vsphere_server       = var.vsphere_server
  allow_unverified_ssl = true
  api_timeout          = 10
}

This will override the default timeout.

https://github.com/hashicorp/terraform-provider-vsphere/blob/0a41cd630a3c308f789249ddc97f2653c8fd66e2/vsphere/provider.go#L186-L195

Ryan Johnson Distinguished Engineer, VMware by Broadcom

ianc769 commented 2 months ago

Hey @tenthirtyam Does this affect the resource itself? I see that the default is 5 minutes according to the docs https://registry.terraform.io/providers/hashicorp/vsphere/latest/docs#api_timeout

Fully deleting the datastore takes about 1 minute it looks like. image

I can try the build/destroy again with it doubled to 10 minutes, but I suspect it will not work.

ianc769 commented 2 months ago

I upped the timeout via ENV Var. VSPHERE_API_TIMEOUT=10. Sadly looks like the provider is faster than vSphere.

image

The datastores do actually delete.

image

Current work around is to run the destroy, clean up the state objects that are listed as failed, then run the apply again.

hatakashi commented 1 month ago

Also noticing similar behaviour regardless of destroying a VM when deleting multiple datastores.

In our case, we are attempting to destroy 8 datastores and receive a timeout on any that are not deleted within 30 seconds or 1 minute and 5 seconds (we've seen both, however it is now only 30 seconds). Workaround mentioned here is the same as we've found works ourselves, however as we're using pipelines triggered by users unfamiliar with Terraform this wouldn't work as the timeout would cause the pipeline to consider itself failed.

Behaviour is noticeable even on relatively small, unused datastores (50-100GB).

Changing the API Timeout seems to have no effect on this.

Terraform 1.8.5

Terraform Provider 2.8.3 & 2.9.2

VMware vSphere 8.0.2.00300