hashicorp / terraform-provider-vsphere

Terraform Provider for VMware vSphere
https://registry.terraform.io/providers/hashicorp/vsphere/
Mozilla Public License 2.0
623 stars 453 forks source link

Deploying Windows Template Fails When Customize Block is Used #2021

Closed itskvad closed 11 months ago

itskvad commented 1 year ago

Community Guidelines

Terraform

v1.5.2

Terraform Provider

v2.4.3

VMware vSphere

7.0.2

Description

While deploying from a template, I encounter the following error:


Error:

β”‚ Virtual machine customization failed on "/Datacenter/vm/VM01":

β”‚

β”‚ An error occurred while setting up network properties of the guest OS.

β”‚ See the log file C:/Windows/TEMP/vmware-imc/guestcust.log in the guest OS for details.

β”‚

β”‚ The virtual machine has not been deleted to assist with troubleshooting. If

β”‚ corrective steps are taken without modifying the "customize" block of the

β”‚ resource configuration, the resource will need to be tainted before trying

β”‚ again. More info: [Terraform Taint Command](https://www.terraform.io/docs/commands/taint.html)

β”‚

Looking into the guestcust.log, I found this line:


simplesock: couldn't bind on source port 1023, error 10048, only one usage of each socket address is normally permitted

The main issue appears to be post deployment, the NIC fails to connect automatically. It only connects after a complete shutdown and restart of the VM, not on a hot reboot.

A couple of observations and steps I've taken so far:

  1. This issue doesn't occur during a manual VM deployment of the template via right-click in the vSphere interface.
  2. I attempted to use a known working template from another vSphere instance, but encountered the same error.
  3. Switching to a different VMware networks didn't resolve the issue.
  4. Deploying VM without customisation is successful.

The deployment works flawlessly on another vSphere instance (prd) without any errors. Same vSphere version, same template, nothing obviously different.

Affected Resources or Data Sources

resource/vsphere_virtual_machine/vm

Terraform Configuration


resource "vsphere_virtual_machine" "vm" {
  name             = "VM01"
  resource_pool_id = data.vsphere_resource_pool.pool.id
  datastore_id     = data.vsphere_datastore.datastore.id

  num_cpus = 2
  memory   = 2048

  guest_id = "windows9Server64Guest"
  firmware                    = "efi"
  efi_secure_boot_enabled     = false

  wait_for_guest_net_timeout = 0
  wait_for_guest_ip_timeout = 0

  network_interface {
    network_id = data.vsphere_network.network.id
    adapter_type = "vmxnet3"
  }

  disk {
    label = "disk0"
    size  = 80
  }

  clone {
    template_uuid = data.vsphere_virtual_machine.template.id

    customize {
        windows_options {
            computer_name  = "VM01"
            # workgroup      = "WORKGROUP"
        }

        network_interface {

        ipv4_address = "192.168.1.108"
        ipv4_netmask = "24"

        }   
        ipv4_gateway    = "192.168.1.254" 
        dns_server_list = ["192.168.1.1", "192.168.1.2"]
        }
    }
  }
}  

data "vsphere_datacenter" "dc" {
  name = "Datacenter"
}

data "vsphere_datastore" "datastore" {
  name          = "Datastore"
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_resource_pool" "pool" {
  name          = "rp"
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_network" "network" {
  name          = "vlan01"
  datacenter_id = data.vsphere_datacenter.dc.id
}

data "vsphere_virtual_machine" "template" {
  name          = "_Template"
  datacenter_id = data.vsphere_datacenter.dc.id
}

Debug Output

https://gist.github.com/itskvad/a5fc1649864f43520351ddf187d168da

Panic Output

No response

Expected Behavior

VM deployed from template with Windows customisations applied. In this case, network and computer name.

Actual Behavior

When customise block added, VM deployment fails. NIC is unable to be in a connected state until a hard turn off and on is done on the VM.

Steps to Reproduce

Deploy Windows Server 2019 Datacenter template using Terraform code

Environment Details

No response

Screenshots

No response

References

No response

github-actions[bot] commented 1 year ago

Hello, itskvad! πŸ–

Thank you for submitting an issue for this provider. The issue will now enter into the issue lifecycle.

If you want to contribute to this project, please review the contributing guidelines and information on submitting pull requests.

teivarodiere commented 1 year ago

Here we go, I was having the same issue in the logs but deploying VMs manually from the same template. i went from 1 out 10 successful deployment, then to 50%, then none. I

over a couple of days i ruled out the following:

It did feel like it was an issue with the VMs NIC so I deleted the VMNXET3 NIC of my newly deployed VMs (from template) and for sure, was able to connect all of my VMs to the network, but the customisations were lost as the process fails due to the initial lack of network connection. So i was convinced it was related to the vNIC.

I converted my VM Template into a VM, deleting the VMXNET3 NIC from it, added a new VMXNET3 NIC, didn't turn on the VM on, converted it back into a Template, and ended up deploying 2 out of 2 VMs no worries. But then later ran into the same issue, about 50% worked.

Day 3, i then remembered that vCenter needs the mac address of the VM to avoid conflicts before it begins configuring VMs. So i hypothesised that the issue may be caused by the fact that I was deploying new VMs from the same template TOO quick. I remembered a new VM has the same MacAddress of the Template initially until it is powered on, and vCenter needs to know about it.

Another observations I made several times was was sometimes VM 1 would customise correctly, then VM 2 and VM 3 didn't, and 4 did. Perhaps time was a factor.

I began waiting several minutes between provisioning VMs and this appear to have solved my problem.

Couple of notes: a) I also proved that I had to wait several minutes after deleting my VMs to redeploy them.
b) My vCenter is in a different state from the cluster, and the latency is what's causing the requirement to wait longer than if both vCenter and the cluster were at the same location (faster networks).

I hope this helps anyone.

itskvad commented 11 months ago

Our issue was NSXT wasn't coping.

"A L2 default Any Any rule is applied to ALL traffic traversing virtual machine workloads.

When the L2 default Any Any rule is set to log events and there is a high amount of traffic, this causes a high hit count and increases the number of CPU threads required to process those hits.

This resource-intensive process leaves fewer CPU threads available for nsx-proxy and other internal daemons to use. The lack of available CPU threads results in the daemons failure to respond to inter-process communication. As a result, the process reports as down.

There is a bug in the current code which prevents cfg-agent from recovering from a down state on its own which leads to the vMotion/ port assignments."

Resolution: This issue is resolved in versions 3.1.4 and 3.2.1 of NSXT.

itskvad commented 11 months ago

Closed.

github-actions[bot] commented 10 months ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.