Cluster API Provider for VMware Cloud Director. The project is an open source implementation of K8s ClusterAPI project and allows customers to provision resources directly from VMware Cloud Director. It enables Cloud Director powered Clouds to be treated as yet-another-cloud in the multi-cloud journey for VMware Cloud Providers.
Apache License 2.0
38
stars
36
forks
source link
a new created node after updating is not able to reach the t1 edge. #504
After redeployment/updates of controlplanes or workers, new created vm's get directly the same ip-address of the deleted old vm (because that is the first free ip in the pool). This can create some issues because most of the time the given templates from vmware don't send any arp requests that they are using the already used ip. NSX-T by default discovers new ip's after a time-out from 8 minutes (the arp table in nsx-t doesn't get directly updated). because of the arp issues the vm cannot ping the t1 router, because of this it can also not reach the api and join the cluster.
So far i have been able to activate the retry join option to the cluster, but this makes the vm creation time arround 10 minutes.
The second solution was to add a arping after booting the vm. now vm creation times are arround 2-3 minutes what is acceptable for me.
i noticed that after running the command: “arping -U -I ens192 -c 3" the tier 1 router becomes responsive. As a fix i have added this command in the template. this brings the creation time to a steady 2-3 minutes all the time.
Reproduction steps
updating nodes to a new image
after one node is deleted and a new node is being created with the same ip as the deleted node.
the new node cannot reach the tier 1 gateway. because of this it cannot join the cluster.
...
Expected behavior
The new created node should be able to reach the gateway directly and lower down the creation time.
Additional context
The error that can be seen when loadbalancer is not reachable: “[preflight] Running pre-flight checks
[2023-05-26 14:59:39] error execution phase preflight: couldn’t validate the identity of the API Server: Get “https://****:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s”: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
[2023-05-26 14:59:39] To see the stack trace of this error execute with --v=5 or higher”
Describe the bug
After redeployment/updates of controlplanes or workers, new created vm's get directly the same ip-address of the deleted old vm (because that is the first free ip in the pool). This can create some issues because most of the time the given templates from vmware don't send any arp requests that they are using the already used ip. NSX-T by default discovers new ip's after a time-out from 8 minutes (the arp table in nsx-t doesn't get directly updated). because of the arp issues the vm cannot ping the t1 router, because of this it can also not reach the api and join the cluster.
So far i have been able to activate the retry join option to the cluster, but this makes the vm creation time arround 10 minutes. The second solution was to add a arping after booting the vm. now vm creation times are arround 2-3 minutes what is acceptable for me.
i noticed that after running the command: “arping -U -I ens192 -c 3" the tier 1 router becomes responsive. As a fix i have added this command in the template. this brings the creation time to a steady 2-3 minutes all the time.
Reproduction steps
Expected behavior
The new created node should be able to reach the gateway directly and lower down the creation time.
Additional context
The error that can be seen when loadbalancer is not reachable: “[preflight] Running pre-flight checks [2023-05-26 14:59:39] error execution phase preflight: couldn’t validate the identity of the API Server: Get “https://****:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s”: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) [2023-05-26 14:59:39] To see the stack trace of this error execute with --v=5 or higher”