Open DennisFaucher opened 3 years ago
Hi Dennis, thanks for reporting this.
This is a limitation with DHCP, and I think work is started to use some sort of IPAM for address management. But interesting that you are hitting this. Most DHCP services will renew an IP assignment so you end up always getting the same IP after a restart.
Since that doesn't appear to be happening here, I think you would have to set of a static lease to make sure the VMs MAC address always gets assigned the same IP.
I'm going to transfer this over to the tanzu-framework
repo. Maybe someone working in that repo might have other ideas how to prevent this from happening.
Let me clarify a bit. When one deploys a TCE standalone cluster to vSphere, one gets a [CLUSTER]-control VM and a [CLUSTER]-md VM. The [CLUSTER]-control VM has three IP addresses 1) The static IPv4 defined during installation 2) A DHCP IPv4 3) An IPv6. The static IPv4 is the address kubectl tries to communicate with for everything. If I reboot [CLUSTER]-control, the static IP disappears and I am only left with 2) & 3).
Ah, thanks for clarifying @DennisFaucher. Definitely an issue there then.
If you ssh to the node, what is the output of "crictl ps"? For some reason kubevip isnt starting so the ip isnt getting published. Also in thr path /etc/kubernetes/manifests there should be a manifest for kube-vip. Can you share the content of that file?
I'll check that. I haven't found instructions on how to ssh into the control node. I tried using my rsa key but I was still prompted for a password. How does one ssh into an Ubuntu control node? Thank you.
ssh with the capv user using the private key matching the key you added at creation time
If it asks for a password then you got the ssh key wrong most likely in the deployment manifest
capv@stand-control-plane-8whqb:/etc/kubernetes/manifests$ cat kube-vip.yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: kube-vip namespace: kube-system spec: containers:
And whats the output of "ip addr"
So it is publishing the ip as can be seen on eth0 as a secondary IP.
Does kubectl still not work?
It lookg like the needed pods and IP are up and running
The cluster works fine. If I reboot the control node VM, it loses its static IP address and I'm not sure why.
On Sun, Oct 17, 2021, 4:22 PM Scott Rosenberg @.***> wrote:
It lookg like the needed pods and IP are up and running
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vmware-tanzu/tanzu-framework/issues/900#issuecomment-945188834, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE5TTXX3ZKOXFBWDEUXNITUHMV6XANCNFSM5GCNYRPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Can you reproduce and run the commands i sent above to see what the status is? Also does it fix itself after some time or does it stay broken?
On Oct 17, 2021 15:39, Dennis Faucher @.***> wrote:
The cluster works fine. If I reboot the control node VM, it loses its static IP address and I'm not sure why.
On Sun, Oct 17, 2021, 4:22 PM Scott Rosenberg @.***> wrote:
It lookg like the needed pods and IP are up and running
β You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vmware-tanzu/tanzu-framework/issues/900#issuecomment-945188834, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE5TTXX3ZKOXFBWDEUXNITUHMV6XANCNFSM5GCNYRPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
β You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/vmware-tanzu/tanzu-framework/issues/900#issuecomment-945191117, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALR7DWEXSSAE4WWJEWBD3KDUHMX6VANCNFSM5GCNYRPA.
Yes, I will reboot the controller node using vSphere Client Power > Restart Guest OS and update the Issue.
Of course, now it is workingπ€¦ββοΈ. I'll close the issue and re-open if/when it happens again. Thanks for your help.
30 minutes later, static IP is gone and cluster unresponsive. I'll run the commands and post the output.
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:9a:9a:83 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.130/24 brd 192.168.1.255 scope global dynamic eth0
valid_lft 3050sec preferred_lft 3050sec
inet6 fe80::250:56ff:fe9a:9a83/64 scope link
valid_lft forever preferred_lft forever
=============
$ sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
70cae152bd2d2 640b7ee0df98b About an hour ago Running kube-scheduler 0 fafb8d79293cd
e312b31b70fa4 060eb69223237 About an hour ago Running kube-controller-manager 0 99ee6b43a83aa
986602ff1b943 6f7c29e5ac889 About an hour ago Running etcd 0 f25ef127878e4
3cdef5f08b2e7 05d7f1f146f50 About an hour ago Running kube-vip 0 ba9933fa9561c
d8297ac133974 0b9437b832f65 About an hour ago Running kube-apiserver 0 326ecc46f596e
=====================
$ cat kube-vip.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: kube-vip
namespace: kube-system
spec:
containers:
- args:
- start
env:
- name: vip_arp
value: "true"
- name: vip_leaderelection
value: "true"
- name: address
value: 192.168.1.57
- name: vip_interface
value: eth0
- name: vip_leaseduration
value: "15"
- name: vip_renewdeadline
value: "10"
- name: vip_retryperiod
value: "2"
image: projects.registry.vmware.com/tkg/kube-vip:v0.3.3_vmware.1
imagePullPolicy: IfNotPresent
name: kube-vip
resources: {}
securityContext:
capabilities:
add:
- NET_ADMIN
- SYS_TIME
volumeMounts:
- mountPath: /etc/kubernetes/admin.conf
name: kubeconfig
hostNetwork: true
volumes:
- hostPath:
path: /etc/kubernetes/admin.conf
type: FileOrCreate
name: kubeconfig
status: {}
========================= TIA
Before reboot and lost static IP, ip a looked like this:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:50:56:9a:9a:83 brd ff:ff:ff:ff:ff:ff
inet 192.168.1.130/24 brd 192.168.1.255 scope global dynamic eth0
valid_lft 3227sec preferred_lft 3227sec
inet 192.168.1.57/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe9a:9a83/64 scope link
valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether f6:e0:9b:1a:47:ad brd ff:ff:ff:ff:ff:ff
4: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
link/ether 76:6a:07:94:fd:95 brd ff:ff:ff:ff:ff:ff
inet6 fe80::746a:7ff:fe94:fd95/64 scope link
valid_lft forever preferred_lft forever
5: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
link/ether 4a:5a:97:62:37:dd brd ff:ff:ff:ff:ff:ff
inet 100.96.0.1/24 brd 100.96.0.255 scope global antrea-gw0
valid_lft forever preferred_lft forever
inet6 fe80::485a:97ff:fe62:37dd/64 scope link
valid_lft forever preferred_lft forever
6: coredns--38a2c8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default
link/ether 76:69:e2:d8:40:5c brd ff:ff:ff:ff:ff:ff link-netns cni-37aca1ec-8dcb-c0ac-3767-0190376516f0
inet6 fe80::7469:e2ff:fed8:405c/64 scope link
valid_lft forever preferred_lft forever
7: coredns--def150@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default
link/ether 02:c5:49:35:6f:19 brd ff:ff:ff:ff:ff:ff link-netns cni-ddae3805-2e0a-41c2-656a-9f193c18ce6f
inet6 fe80::c5:49ff:fe35:6f19/64 scope link
valid_lft forever preferred_lft forever
8: tanzu-ca-0cc37c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default
link/ether 0e:c8:a1:11:2a:6f brd ff:ff:ff:ff:ff:ff link-netns cni-e5ae675f-4408-7f61-71a8-0f6d2dd77dd2
inet6 fe80::cc8:a1ff:fe11:2a6f/64 scope link
valid_lft forever preferred_lft forever
Tried cheating
$ sudo ip addr add 192.168.1.57/24 dev eth0
$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:9a:9a:83 brd ff:ff:ff:ff:ff:ff inet 192.168.1.130/24 brd 192.168.1.255 scope global dynamic eth0 valid_lft 3325sec preferred_lft 3325sec inet 192.168.1.57/24 scope global secondary eth0 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:fe9a:9a83/64 scope link valid_lft forever preferred_lft forever
β― kubectl get nodes
NAME STATUS ROLES AGE VERSION
stand-control-plane-8whqb.fios-router.home NotReady control-plane,master 2d23h v1.21.2+vmware.1
stand-md-0-d876bfc78-qt5nr.fios-router.home Ready
$ sudo journalctl -u kubelet -r
-- Logs begin at Mon 2021-10-18 08:04:55 UTC, end at Mon 2021-10-18 13:55:07 UTC. -- Oct 18 13:55:07 stand-control-plane-8whqb kubelet[543]: E1018 13:55:07.059871 543 kubelet.go:2291] "Error getting node" err="node \"stand-control-plane-8whqb\" not found"
capv@stand-control-plane-8whqb:/etc/systemd/system$ sudo systemctl status kubelet β kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d ββ10-kubeadm.conf Active: active (running) since Mon 2021-10-18 13:57:31 UTC; 49s ago Docs: https://kubernetes.io/docs/home/ Main PID: 1712 (kubelet) Tasks: 13 (limit: 4690) Memory: 30.0M CGroup: /system.slice/kubelet.service ββ1712 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yam>
Oct 18 13:58:20 stand-control-plane-8whqb kubelet[1712]: E1018 13:58:20.805794 1712 kubelet.go:2291] "Error getting node" err="node \"stand-control-plane-8whqb\" not found"
Added eth0 back into netplan
$ cp 01-netcfg.yaml 01-netcfg.yaml.sav
$ sudo vi 01-netcfg.yaml
network: version: 2 renderer: networkd ethernets: ens192: dhcp4: yes dhcp6: yes eth0: dhcp4: no addresses:
$ sudo netplan apply
β― kubectl get nodes
NAME STATUS ROLES AGE VERSION
stand-control-plane-8whqb.fios-router.home NotReady control-plane,master 2d23h v1.21.2+vmware.1
stand-md-0-d876bfc78-qt5nr.fios-router.home Ready
$ sudo systemctl restart kubelet
$ sudo systemctl status kubelet β kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d ββ10-kubeadm.conf Active: active (running) since Mon 2021-10-18 14:18:27 UTC; 16s ago Docs: https://kubernetes.io/docs/home/ Main PID: 2299 (kubelet) Tasks: 13 (limit: 4690) Memory: 29.2M CGroup: /system.slice/kubelet.service ββ2299 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yam>
Oct 18 14:18:43 stand-control-plane-8whqb kubelet[2299]: E1018 14:18:43.851599 2299 kubelet.go:2291] "Error getting node" err="node \"stand-control-plane-8whqb\" not found"
vRabbi and I had a lovely Zoom call and it looks like an issue with Antrea and standalone clusters.
Update: Photon seems to behave better than Ubuntu on Control Plane node reboot. I created a Production Large Photon Management Cluster. Rebooted the control plane node with the static IP address. The static IP address moved to another photon control plane node in the cluster and the cluster remains accessible. With Ubuntu, the static ip address was lost and never returned.
so it really comes down to managing machine's DHCP leases in vsphere, or wherever else, so that they behave as static IPs do.
This is a documentation issue, we need to take existing docs for VMWare Tanzu and make them have the same clarity upstream WRT DHCP and the importance of immutable IPs for nodes.
Most likely. @vrabbi took off with this.
cc @clintkitson
Also, to confirm, the controller node static IP address is not coming back. And only when the controller node type is Ubuntu, not Photon.
so it really comes down to managing machine's DHCP leases in vsphere, or wherever else, so that they behave as static IPs do.
There's no DHCP server provided by vSphere, so it can't be handled by CAPV alone. vSphere provides L2 segments, and L3 is the network operator's concern. There is however a proposal to integrate IPAM from @schrej that needs reviewing here https://docs.google.com/document/d/1hvirbdV_QTbKBMxgvX045OuV8_f-xZETuH6HENZ0uxQ/edit
Let me clarify a bit. This has nothing to do with DHCP. When one deploys a TCE standalone cluster to vSphere, one gets a [CLUSTER]-control VM and a [CLUSTER]-md VM. The [CLUSTER]-control VM has three IP addresses 1) The static IPv4 defined during installation 2) A DHCP IPv4 3) An IPv6. The static IPv4 is the address kubectl tries to communicate with for everything. If I reboot [CLUSTER]-control, the static IP disappears and I am only left with 2) & 3).
Ah, ok, this is probably cloud-init related then. One for @codenrhoden
@DennisFaucher @codenrhoden if we can open up an issue in https://github.com/kubernetes-sigs/image-builder/ , that'd be ideal.
Is there some progress on this issue? I have the same behavior with the latest version
Bug Report
I have a working standalone TCE cluster on vSphere 7 on a standard switch. The management IP for the control plane is 1.57 and the secondary IP is 1.144 (DHCP). If I reboot the node, the 1.57 address is gone and I can no longer access the cluster with kubectl get nodes.
Expected Behavior
Reboot nodes and management IP is restored
Steps to Reproduce the Bug
Create standalone cluster Shut down control plane and md nodes Startup both nodes
Screenshots or additional information and context
Environment Details
tanzu version
): version: v0.2.1 buildDate: 2021-09-29 sha: ceaa474Diagnostics and log bundle