MGMT cluster on vSphere - single or multi - stuck at initial network configuration

5280tunage commented 2 years ago

Bug Report

Hi there, looking for some help with an issue that is happening every time I try to deploy a management clusters (development or production template) to a vSphere 7.0.2 Cluster running vCenter 7.0.3.

Current setup: Windows 10 VM on vCenter/vSphere Cluster - running Docker Desktop just fine, no connection issues to hosts, clusters, vcenter. Deployment Win10 Box has all pre-reqs installed fine from the looks of it, process kicks off just fine with no issues. (I will say on this front, not sure about others but the documented deployment win10 node requirements seem way too low. The process failed left and right until I moved the deployment VM from my laptop to an ESXi host with 4cores and 12GB ram assigned to the VM).

Target cluster is a 2 node vSphere 7.0.2 cluster, with NFS storage, multiple vDS VLAN backed networks, all /24's with L3 routing at a core switch. All appropriate permissions (based on the deploy docs) have been implemented in the environment.

Currently attempting to deploy version 090, and have tried both the Ubuntu and Photon v1.21.2 OVA's. Endpoint IP address in the deployment wizard is 192.168.13.8, which is in a /24 subnet, with vlan 13 backing it. there are around 200 ip addresses available in the dhcp pool with the first 50 being excepted for static, default gw 192.168.13.1. No proxies between subnets. The cluster subnets are the default settings in the wizard.

After starting the docker desktop environment, running the 'tanzu management-cluster create' command, everything seems to work just fine. The process makes it all the way to cloning the base OVA but that's where things break down.

During the bootup of the cloned image, it appears as though the VM never properly gets the static IP address I've assigned it in the wizard, a continuous ping to the address fails permanently and the following items appear.

Powershell output

PS C:\Windows\system32> tanzu management-cluster create --ui

Validating the pre-requisites... Serving kickstart UI at http://127.0.0.1:8080 Identity Provider not configured. Some authentication features won't work. Validating configuration... web socket connection established sending pending 2 logs to UI Using infrastructure provider vsphere:v0.7.10 Generating cluster configuration... Setting up bootstrapper... Bootstrapper created. Kubeconfig: C:\Users\jared.kube-tkg\tmp\config_LrACCWww Installing providers on bootstrapper... Fetching providers Installing cert-manager Version="v1.1.0" Waiting for cert-manager to be available... Installing Provider="cluster-api" Version="v0.3.23" TargetNamespace="capi-system" Installing Provider="bootstrap-kubeadm" Version="v0.3.23" TargetNamespace="capi-kubeadm-bootstrap-system" Installing Provider="control-plane-kubeadm" Version="v0.3.23" TargetNamespace="capi-kubeadm-control-plane-system" Installing Provider="infrastructure-vsphere" Version="v0.7.10" TargetNamespace="capv-system" Start creating management cluster...

Failure while deploying management cluster, Here are some steps to investigate the cause:

Debug: kubectl get po,deploy,cluster,kubeadmcontrolplane,machine,machinedeployment -A --kubeconfig C:\Users\jared.kube-tkg\tmp\config_LrACCWww kubectl logs deployment.apps/ -n manager --kubeconfig C:\Users\jared.kube-tkg\tmp\config_LrACCWww

To clean up the resources created by the management cluster: tanzu management-cluster delete unable to set up management cluster, : unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): timed out waiting for cluster creation to complete: cluster control plane is still being initialized Identity Provider not configured. Some authentication features won't work. Validating configuration... web socket connection established sending pending 5 logs to UI Using infrastructure provider vsphere:v0.7.10 Generating cluster configuration... Setting up bootstrapper... Bootstrapper created. Kubeconfig: C:\Users\jared.kube-tkg\tmp\config_pEXva3u1 Installing providers on bootstrapper... Fetching providers Installing cert-manager Version="v1.1.0" Waiting for cert-manager to be available... Installing Provider="cluster-api" Version="v0.3.23" TargetNamespace="capi-system" Installing Provider="bootstrap-kubeadm" Version="v0.3.23" TargetNamespace="capi-kubeadm-bootstrap-system" Installing Provider="control-plane-kubeadm" Version="v0.3.23" TargetNamespace="capi-kubeadm-control-plane-system" Installing Provider="infrastructure-vsphere" Version="v0.7.10" TargetNamespace="capv-system" Start creating management cluster...

Failure while deploying management cluster, Here are some steps to investigate the cause:

Debug: kubectl get po,deploy,cluster,kubeadmcontrolplane,machine,machinedeployment -A --kubeconfig C:\Users\jared.kube-tkg\tmp\config_pEXva3u1 kubectl logs deployment.apps/ -n manager --kubeconfig C:\Users\jared.kube-tkg\tmp\config_pEXva3u1

To clean up the resources created by the management cluster: tanzu management-cluster delete unable to set up management cluster, : unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): timed out waiting for cluster creation to complete: cluster control plane is still being initialized

Logging Console Output

ℹ [0128 23:19:33.51037]: init.go:108] Validating configuration... ℹ [0128 23:19:38.06648]: init.go:159] Using infrastructure provider vsphere:v0.7.10 ℹ [0128 23:19:38.07468]: init.go:161] Generating cluster configuration... ℹ [0128 23:19:43.69018]: init.go:169] Setting up bootstrapper... ℹ [0128 23:19:45.84430]: client.go:123] Fetching configuration for kind node image... ℹ [0128 23:19:45.85117]: client.go:237] kindConfig: &{{Cluster kind.x-k8s.io/v1alpha4} [{ map[] [{/var/run/docker.sock /var/run/docker.sock false false }] [] [] []}] { 0 100.96.0.0/13 100.64.0.0/13 false } map[] map[] [apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration imageRepository: projects.registry.vmware.com/tkg etcd: local: imageRepository: projects.registry.vmware.com/tkg imageTag: v3.4.13_vmware.15 dns: type: CoreDNS imageRepository: projects.registry.vmware.com/tkg imageTag: v1.8.0_vmware.5] [] [] []} ℹ [0128 23:19:45.85117]: client.go:131] Creating kind cluster: tkg-kind-c7qdp0dcpuoh3h5h3qb0 ℹ [0128 23:19:52.60606]: logger.go:115] Creating cluster "tkg-kind-c7qdp0dcpuoh3h5h3qb0" ... ℹ [0128 23:19:52.60606]: logger.go:115] Ensuring node image (projects.registry.vmware.com/tkg/kind/node:v1.21.2_vmware.1) ... ℹ [0128 23:19:55.26557]: logger.go:115] Image: projects.registry.vmware.com/tkg/kind/node:v1.21.2_vmware.1 present locally ℹ [0128 23:19:57.85211]: logger.go:115] Preparing nodes ... ℹ [0128 23:20:32.05869]: logger.go:115] Writing configuration ... ℹ [0128 23:20:55.76368]: logger.go:115] Starting control-plane ... ℹ [0128 23:21:48.50128]: logger.go:115] Installing CNI ... ℹ [0128 23:21:57.38337]: logger.go:115] Installing StorageClass ... ℹ [0128 23:22:13.67450]: logger.go:115] Waiting 2m0s for control-plane = Ready ... ℹ [0128 23:22:19.67151]: logger.go:115] Ready after 4s ℹ [0128 23:22:56.80025]: init.go:176] Bootstrapper created. Kubeconfig: C:\Users\jared.kube-tkg\tmp\config_pEXva3u1 ℹ [0128 23:22:56.98962]: init.go:188] Installing providers on bootstrapper... ℹ [0128 23:24:24.17906]: init.go:482] installed Component=="cluster-api" Type=="CoreProvider" Version=="v0.3.23" ℹ [0128 23:24:24.17910]: init.go:482] installed Component=="kubeadm" Type=="BootstrapProvider" Version=="v0.3.23" ℹ [0128 23:24:24.17910]: init.go:482] installed Component=="kubeadm" Type=="ControlPlaneProvider" Version=="v0.3.23" ℹ [0128 23:24:24.17910]: init.go:482] installed Component=="vsphere" Type=="InfrastructureProvider" Version=="v0.7.10" ℹ [0128 23:24:24.84000]: init.go:651] Waiting for provider infrastructure-vsphere ℹ [0128 23:24:24.84003]: init.go:651] Waiting for provider cluster-api ℹ [0128 23:24:24.84003]: init.go:651] Waiting for provider control-plane-kubeadm ℹ [0128 23:24:24.84003]: init.go:651] Waiting for provider bootstrap-kubeadm ℹ [0128 23:24:25.26386]: clusterclient.go:1105] Waiting for resource capi-kubeadm-control-plane-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:25.29133]: clusterclient.go:1105] Waiting for resource capi-kubeadm-bootstrap-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:25.37413]: clusterclient.go:1105] Waiting for resource capv-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:25.42063]: clusterclient.go:1105] Waiting for resource capi-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:40.53867]: clusterclient.go:1105] Waiting for resource capi-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:40.55779]: clusterclient.go:1105] Waiting for resource capi-kubeadm-bootstrap-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:40.60569]: init.go:659] Passed waiting on provider cluster-api after 15.7545543s ℹ [0128 23:24:40.69034]: init.go:659] Passed waiting on provider bootstrap-kubeadm after 15.8503155s ℹ [0128 23:24:50.51253]: clusterclient.go:1105] Waiting for resource capi-kubeadm-control-plane-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:24:50.59350]: init.go:659] Passed waiting on provider control-plane-kubeadm after 25.7534787s ℹ [0128 23:25:05.49097]: clusterclient.go:1105] Waiting for resource capv-controller-manager of type v1.Deployment to be up and running ℹ [0128 23:25:05.54663]: init.go:659] Passed waiting on provider infrastructure-vsphere after 40.7066744s ℹ [0128 23:25:05.54663]: init.go:670] Success waiting on all providers. ℹ [0128 23:25:05.54833]: init.go:202] Start creating management cluster... ‼ [0129 00:00:20.74099]: init.go:705] Failure while deploying management cluster, Here are some steps to investigate the cause: ‼ [0129 00:00:20.74099]: init.go:706] Debug: ‼ [0129 00:00:20.75161]: init.go:707] kubectl get po,deploy,cluster,kubeadmcontrolplane,machine,machinedeployment -A --kubeconfig C:\Users\jared.kube-tkg\tmp\config_pEXva3u1 ‼ [0129 00:00:20.75161]: init.go:708] kubectl logs deployment.apps/ -n manager --kubeconfig C:\Users\jared.kube-tkg\tmp\config_pEXva3u1 ‼ [0129 00:00:20.76881]: init.go:711] To clean up the resources created by the management cluster: ‼ [0129 00:00:20.78402]: init.go:712] tanzu management-cluster delete ✘ [0129 00:00:20.80994]: init.go:86] unable to set up management cluster, : unable to wait for cluster and get the cluster kubeconfig: error waiting for cluster to be provisioned (this may take a few minutes): timed out waiting for cluster creation to complete: cluster control plane is still being initialized

Expected Behavior

To me, it very much appears as though the IP Address isn't getting properly assigned from the static address input. The VM is using a VMXNET3 adapter, I've tried re-attaching the NIC, changing the network backing, etc. I have noticed it appears in vCenter to show an IPV6 address for the VM, but not an IPV4 address.

Another issue One thing I've also noticed, is that at this point, regardless if I let it timeout and fail, or if I kill the powershell command, it doesn't appear to properly clear out previous attempts. That is, even though the process fails, it tells me on subsequent attempts that the management cluster already exists. I've even tried manually deleting previous deployment scripts, etc. I have to change the name of the cluster.

Steps to Reproduce the Bug

Every deployment, regardless of which distributed port group I use, the VM never appears to get a static address. I have other services, like VIC, and even minikub running in these environments, all working fine with static and DHCP connectivity.

Screenshots or additional information and context

Environment Details

Build version (tanzu version): 0.9.0
Deployment (Managed/Standalone cluster): Managed and Standalone - same result
Infrastructure Provider (Docker/AWS/Azure/vSphere): vSphere 7.0.2, vCenter 7.0.3, dvs version 6.5.0
Operating System (client): Win10 build 19043.1466

Diagnostics and log bundle

Happy to collect other logs if they are available, can't login to the deployed VM to troubleshoot from there. Have tried rebooting the VM several times as well.

github-actions[bot] commented 2 years ago

Hey @jhedman2! Thanks for opening your first issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on Tanzu Community Edition.

stmcginnis commented 2 years ago

Hi @5280tunage - thanks for reaching out. I have seen similar failures to this when someone tried to deploy in an environment that did not have DHCP addresses available. Can you confirm that your environment has DHCP available and that the VMs are being created on a network that is able to get those request/responses?

5280tunage commented 2 years ago

Thanks @stmcginnis for your response. As I believe I threw in the description, the vlan these VM's is assigned to has other K8s lab nodes that have been working just fine with DHCP and booting. I did just double check, just to be sure, the DHCP scope on the router for that segment only has 23 current leases out of 200 available addresses, so that shouldn't be the issue.

Maybe this would help, can you describe the process of deploying the management pod? i.e. I thought the IP address we enter in the wizard for the endpoint (in this case 192.168.13.8) is what gets assigned to that VM that is deployed via the deployment manager is it not? So would it even attempt to use DHCP at all? So when the ephemeral mgmt pod is created, to create the wizard, which is then used to deploy the management pod to vSphere, is there an automation routine that manually configures that address at some point?

stmcginnis commented 2 years ago

When the VM is deployed, it still required DHCP to provide it an IP address. After it is up and running, it still uses DHCP, but assigned a virtual IP address based on the provided static one from the configuration.

You had mentioned not receiving an IPV4 address, only IPV6, so that is why I asked. Something must be preventing it from properly getting its own address. I'm not sure what that could be, but it must be something in the local environment.

Any chance you can configure a test VM with the exact same settings as the provisioned VM to see if there is any difference? I'll try to think of anything else to check, but the root of it does seem to be the lack of that IPv4 address getting assigned.

5280tunage commented 2 years ago

I actually just deployed a VM from the PhotonOS template (photon-3-kube-v1.21.2+vmware.1) and within seconds of booting it received a valid DHCP address in the correct VLAN. While I wasn't able to login to the VM (root pass doesn't seem to be the default photon root), i was able to ping the VM from several servers and workstations just fine. Definitely weird but I agree that was a good test. Given my layout, is there a log someplace on the deployment environment that might help us?

Also, as per a previous request as well, is there place I can clear out those previous cluster names that never actually got deployed?

stmcginnis commented 2 years ago

Also, as per a previous request as well, is there place I can clear out those previous cluster names that never actually got deployed?

That sounds like there is some metadata being left behind in ${HOME}/.config/tanzu/clusterconfigs (I think that's the same on Windows).

Given my layout, is there a log someplace on the deployment environment that might help us?

Can you compare the configuration of the machine being created by TCE with the configuration used in your successful test? There really shouldn't be any difference when created with TCE, so it seems like it would have to be something underneath. Same vSwitch, same NIC type?

5280tunage commented 2 years ago

So I actually did another quick test today. I took the exact OVA/template that I reference in the build (photon-3-kube-v1.21.2+vmware.1), converted it to a VM, booted the VM, and voila, in less than a minute it was sitting at a root login prompt and I could ping it's ipv4 address via any computer in the network. vCenter was properly updated with the IP address that vmtools was reporting.

So with that, it really seems like something in the bootstrap process is preventing say the network stack from firing up properly. I'm going to try and get some time to pour through the templates and overlays, just to see if I see anything. I can't login to the VM that's deployed as it's not using the standard photon root password I'm used to, so I really have no way of checking logs on the VM itself.

Strangely, the Ubuntu template exhibits exactly the same behavior. thanks for your help, really trying to figure this out.

5280tunage commented 2 years ago

I've done a ton more troubleshooting, still to no avail. Really not sure what else to do. As stated, deploying a VM from either template in vCenter to exactly the same network works perfectly fine every time. I thought I may have found an issue, in that the IPV4 settings in the downloaded OVA didn't have DHCP checked in the OVA settings, I changed those settings and converted back to a Template, redeployed a mgmt cluster, and still no dice. Those settings FYI:

I'm attaching several logs that demonstrate some of the errors. you can tell the system is trying to contact the VM at the endpoint address specified, but I can't find anywhere in any of the logs where the system is actually starting with DHCP, then making the changes to the VM. Couple things come to mind, first off, is there a log in the docker container I can look at to find out where the customization/bootstrap process is working and/or dying? A specific log, like CAPI, CAPV, etc.

Also, what are the chance none of the customization process is working? Is there a log to find out? Could an invalid public key entered in the wizard cause all of this? I created a priv/pub key via puttygen. And strangely, I've seen a few variations online of what needs to be entered into that field. Super frustrating, I've been pouring over every line of code. I actually went into the overlay and tried to bypass any IPV6 addresses as well. That brings up another point, I have no DHCP server in my environment that will hand out IPV6 addresses. So based on the IPV6 address (fe80::250:56ff:fe82:828e), I think it's actually nothing more than a link local address.

Appreciate any help cert-manager.txt capi-kubeadm-control-plane logs.txt capv-logs.txt .

5280tunage commented 2 years ago

Also just followed this doc to make sure my dhcp server is sending the correct dhcp options. I can't believe I'm somehow the only person experiencing this... I have a pretty common lab setup, HP hosts, NFS storage, 7U3 vCetner, 7U2 hosts, distributed virtual switches without NSX...

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.4/vmware-tanzu-kubernetes-grid-14/GUID-mgmt-clusters-vsphere.html#mc-vsphere7

I have the listed dhcp options statically configured.

ghost commented 2 years ago

Also just followed this doc to make sure my dhcp server is sending the correct dhcp options. I can't believe I'm somehow the only person experiencing this... I have a pretty common lab setup, HP hosts, NFS storage, 7U3 vCetner, 7U2 hosts, distributed virtual switches without NSX...

https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/1.4/vmware-tanzu-kubernetes-grid-14/GUID-mgmt-clusters-vsphere.html#mc-vsphere7

I have the listed dhcp options statically configured.

I am experiencing exactly the same strange behavior, but with downstream TKG 1.4.1. Tried ubuntu and photon OVAs, with same issue. I also have common environment: 7.0.3 ESXis and 7.0u2 vcenter, regular vSwitch, DRS on, HA on, etc. And have all required options in our DHCP server (gw, ntp, dns, mask). This DHCP server works just fine in other VMs on same VLAN, also checked that i am not out of IPs. Checked cloud-init metada and seems fine, with DHCP activated on VMXNET's MAC.

If I clone the template and try to turn it on without any cloud-init data, also fails with the same error on startup.

stmcginnis commented 2 years ago

I can't login to the VM that's deployed as it's not using the standard photon root password I'm used to

The deployed machine can be SSH'd into using the instructions here: https://tanzucommunityedition.io/docs/latest/tips/#connect-to-cluster-nodes-with-ssh

Not sure if that is useful, but maybe that can help to compare the runtime configuration.

stmcginnis commented 2 years ago

@yastij - it looks like you may have dealt with similar issues in the past. Any ideas here? Or do you know if there is someone else we can ask for input here? Thanks!

5280tunage commented 2 years ago

I can't login to the VM that's deployed as it's not using the standard photon root password I'm used to

The deployed machine can be SSH'd into using the instructions here: https://tanzucommunityedition.io/docs/latest/tips/#connect-to-cluster-nodes-with-ssh

Not sure if that is useful, but maybe that can help to compare the runtime configuration.

Sadly that won't work, the management node that's deployed in the vSphere environment isn't getting a DHCP address, so there's really no way to SSH to the node. Thus why I was trying to login via console. But thanks for trying to help!

ghost commented 2 years ago

I can't login to the VM that's deployed as it's not using the standard photon root password I'm used to

The deployed machine can be SSH'd into using the instructions here: https://tanzucommunityedition.io/docs/latest/tips/#connect-to-cluster-nodes-with-ssh Not sure if that is useful, but maybe that can help to compare the runtime configuration.

Sadly that won't work, the management node that's deployed in the vSphere environment isn't getting a DHCP address, so there's really no way to SSH to the node. Thus why I was trying to login via console. But thanks for trying to help!

We could try the following: 1- Try to deploy TKG/Com. 2- When VM is deployed, stop deployment with kind delete cluster. This way we can shutdown the VM, because tanzu tries to turn it on continuously. We need the VM in a power off state to edit its advanced configuration keys. 3- Edit cloud-init through VM advanced configuration (I think in vCenter 7.0u3 this is a bit easier), as of 7.0 u1/2: Edit Settings > VM Options (Tab) > Advanced > Configuration parameters, Edit configuration.

If I am not mistaken cloud-init needs these keys: guestinfo.metadata (IP config) guestinfo.metadata.encoding = base64 guestinfo.userdata (users and ssh keys config, and more) guestinfo.userdata.encoding = base64

You may edit guestinfo.metadata to include a timeout in network configuration, and edit guestinfo.userdata to include a root password for login

ghost commented 2 years ago

https://bugs.launchpad.net/cloud-init/+bug/1946493 seems related, cloud-init is 21.1-x?

ghost commented 2 years ago

I solved my issue, I had an ACL blocking intra-vlan dhcp traffic through trunk link, DHCPREQ (unicast) worked for older clients or wih old leases, but not for DHCPDISCOVER!

stmcginnis commented 2 years ago

Awesome, glad you figured it out @gbarceloPIB! Thanks for letting us know what it was.

@5280tunage any chance you have a similar set up in your environment?

5280tunage commented 2 years ago

Unfortunately no, but my understanding is the DHCPDISCOVER should only happen when using a DHCP relay. Given that my switch is the DHCP server, there is no DHCP relay in the equation, and I certainly don't have any ACL's blocking traffic types like this on my L2/L3 switch.

ghost commented 2 years ago

Unfortunately no, but my understanding is the DHCPDISCOVER should only happen when using a DHCP relay. Given that my switch is the DHCP server, there is no DHCP relay in the equation, and I certainly don't have any ACL's blocking traffic types like this on my L2/L3 switch.

From my understanding this is not how DHCP works (simplifying):

If client had a previous old lease tries to request this one, with DHCPREQUEST (unicast)
IF no leases were found, do DHCPDISCOVER to broadcast MAC (ff...ff) and 255.255.255.255 IP.
DHCP server responds with DHCP Offer.
etc

You may spin up a TKG Ubuntu OVA, when grub appears, go to Advanced, and start a recovery root shell, change root password and restart. Now you will have local root access. Flush nic with ip addr flush dev ens192 (or similar, could be wrong). Then perform comamnd dhclient -v -4 to check if DHCP works as it should.

5280tunage commented 2 years ago

thanks, i'm trying to follow the process @gbarceloPIB has suggested, I've booted into the recovery shell and have changed the root password, but whats the best way to avoid the cloud-init process, as every subsequent restart simply restarts the cloud-init process which is where the VM hangs indefinitely. I checked in the grub editor and don't see an easy way to bypass the cloud-init process. using recovery mode with networking also fails with multiple errors.

5280tunage commented 2 years ago

Once again, I spent most of the day troubleshooting this and found nothing to work. I'll post some more screenshots in case anyone can think of anything, but I've tried literally everything I can think of.

I put the windows10 system with docker desktop in the exact same vlan as the deployed templates, just to negate any L2 broadcast issues.
I enabled a Win2012 server as a dhcp server for that particular vlan and setup a helper address for it
As you'll see below, I did all kinds of things with the actual deployed VM from the template, including putting a static address in the network config, nothing after restart
changed nic type to e1000 and back to vmxnet
Used vrealize network insight and logging on switch to look for anything from the VM, not a single issue was showing up
I do however see a ton of outbound broadcast traffic from the vnic the VM is attached to, with very little ingress
you'll notice in one of the windows below, the eth0 interface is showing down all the time
remember, any VM i clone from the photon or ubuntu templates boots up perfectly and connects via dhcp if NOT using the TCE deployment process (cloud-init)

First, notice this image, shows the interface is noop down

-Next, this image, where you can see the correct data is listed in the config scripts

then, notice here where I static configured an address, which never worked (and yes, the mac address changed after changing the nic type)

Any help still greatly appreciated.

ghost commented 2 years ago

@5280tunage

As you'll see below, I did all kinds of things with the actual deployed VM from the template, including putting a static address in the network config, nothing after restart.

I think you have to change instance-id to apply a new metadata? Have you tried flushing the device, and setting an IP addr and GW through iproute2 suite? Then exec dhclient -v command and check output. I would also try:

Try to put the DHCP server in the same esxi host, and same vSwitch (or DS), it may be like this already? You may put a little dnsmasq test server. This way you will be sure that you do not have broadcast issues in your L2 infra (what happened to me).
Is stated in TKG docs that you also need to specify an NTP server in DHCP options. Still, I think this would not provoke a boot locking like this one.
After a 'bad' boot with DHCP activated and no wait for network enabled (also with no userdata set), you may check for logs: journalctl | grep -i 'dhcp'
When VM boots up, check for DHCP server logs, or capture traffic on DHCP server. Tanzu-mgmt's DHCPDISCOVER have to be tehre.
Re-import your OVA templates, check for sha256 sums first. Also do not power up, only mark as template after import.

I am sorry, but I cannot figure out another path...

stmcginnis commented 2 years ago

Hi @5280tunage - just wanted to check back on this. I don't have any new suggestions or ideas, but wanted to check if you were ever able to resolve this. It does appear to be something specific to your environment, but not really sure what.

vmware-tanzu / community-edition