vmware / cluster-api-provider-cloud-director

Cluster API Provider for VMware Cloud Director. The project is an open source implementation of K8s ClusterAPI project and allows customers to provision resources directly from VMware Cloud Director. It enables Cloud Director powered Clouds to be treated as yet-another-cloud in the multi-cloud journey for VMware Cloud Providers.
Apache License 2.0
38 stars 36 forks source link

a new created node after updating is not able to reach the t1 edge. #504

Open TimTinneveld opened 1 year ago

TimTinneveld commented 1 year ago

Describe the bug

After redeployment/updates of controlplanes or workers, new created vm's get directly the same ip-address of the deleted old vm (because that is the first free ip in the pool). This can create some issues because most of the time the given templates from vmware don't send any arp requests that they are using the already used ip. NSX-T by default discovers new ip's after a time-out from 8 minutes (the arp table in nsx-t doesn't get directly updated). because of the arp issues the vm cannot ping the t1 router, because of this it can also not reach the api and join the cluster.

So far i have been able to activate the retry join option to the cluster, but this makes the vm creation time arround 10 minutes. The second solution was to add a arping after booting the vm. now vm creation times are arround 2-3 minutes what is acceptable for me.

i noticed that after running the command: “arping -U -I ens192 -c 3" the tier 1 router becomes responsive. As a fix i have added this command in the template. this brings the creation time to a steady 2-3 minutes all the time.

Reproduction steps

  1. updating nodes to a new image
  2. after one node is deleted and a new node is being created with the same ip as the deleted node.
  3. the new node cannot reach the tier 1 gateway. because of this it cannot join the cluster. ...

Expected behavior

The new created node should be able to reach the gateway directly and lower down the creation time.

Additional context

The error that can be seen when loadbalancer is not reachable: “[preflight] Running pre-flight checks [2023-05-26 14:59:39] error execution phase preflight: couldn’t validate the identity of the API Server: Get “https://****:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s”: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) [2023-05-26 14:59:39] To see the stack trace of this error execute with --v=5 or higher”

arunmk commented 1 year ago

The specific fix as root-caused by @TimTinneveld is to have the following in the cloud-init file:

arping -U -I ens192 <VM_SELF_IP_address> -c 3