Control Plane VM Loses Management IP When Restarted

DennisFaucher commented 3 years ago

Bug Report

I have a working standalone TCE cluster on vSphere 7 on a standard switch. The management IP for the control plane is 1.57 and the secondary IP is 1.144 (DHCP). If I reboot the node, the 1.57 address is gone and I can no longer access the cluster with kubectl get nodes.

Expected Behavior

Reboot nodes and management IP is restored

Steps to Reproduce the Bug

Create standalone cluster Shut down control plane and md nodes Startup both nodes

Screenshots or additional information and context

Environment Details

Build version (tanzu version): version: v0.2.1 buildDate: 2021-09-29 sha: ceaa474
Deployment (Managed/Standalone cluster): Standalone
Infrastructure Provider (Docker/AWS/Azure/vSphere): vSphere
Operating System (client): MacOS 10.15.7

Diagnostics and log bundle

stmcginnis commented 3 years ago

Hi Dennis, thanks for reporting this.

This is a limitation with DHCP, and I think work is started to use some sort of IPAM for address management. But interesting that you are hitting this. Most DHCP services will renew an IP assignment so you end up always getting the same IP after a restart.

Since that doesn't appear to be happening here, I think you would have to set of a static lease to make sure the VMs MAC address always gets assigned the same IP.

I'm going to transfer this over to the tanzu-framework repo. Maybe someone working in that repo might have other ideas how to prevent this from happening.

DennisFaucher commented 3 years ago

Let me clarify a bit. When one deploys a TCE standalone cluster to vSphere, one gets a [CLUSTER]-control VM and a [CLUSTER]-md VM. The [CLUSTER]-control VM has three IP addresses 1) The static IPv4 defined during installation 2) A DHCP IPv4 3) An IPv6. The static IPv4 is the address kubectl tries to communicate with for everything. If I reboot [CLUSTER]-control, the static IP disappears and I am only left with 2) & 3).

stmcginnis commented 3 years ago

Ah, thanks for clarifying @DennisFaucher. Definitely an issue there then.

vrabbi commented 3 years ago

If you ssh to the node, what is the output of "crictl ps"? For some reason kubevip isnt starting so the ip isnt getting published. Also in thr path /etc/kubernetes/manifests there should be a manifest for kube-vip. Can you share the content of that file?

DennisFaucher commented 3 years ago

I'll check that. I haven't found instructions on how to ssh into the control node. I tried using my rsa key but I was still prompted for a password. How does one ssh into an Ubuntu control node? Thank you.

vrabbi commented 3 years ago

ssh with the capv user using the private key matching the key you added at creation time

vrabbi commented 3 years ago

If it asks for a password then you got the ssh key wrong most likely in the deployment manifest

DennisFaucher commented 3 years ago

Screen Shot 2021-10-17 at 2 05 06 PM

capv@stand-control-plane-8whqb:/etc/kubernetes/manifests$ cat kube-vip.yaml apiVersion: v1 kind: Pod metadata: creationTimestamp: null name: kube-vip namespace: kube-system spec: containers:

args:
- start env:
- name: vip_arp value: "true"
- name: vip_leaderelection value: "true"
- name: address value: 192.168.1.57
- name: vip_interface value: eth0
- name: vip_leaseduration value: "15"
- name: vip_renewdeadline value: "10"
- name: vip_retryperiod value: "2" image: projects.registry.vmware.com/tkg/kube-vip:v0.3.3_vmware.1 imagePullPolicy: IfNotPresent name: kube-vip resources: {} securityContext: capabilities: add:
  - NET_ADMIN
  - SYS_TIME volumeMounts:
- mountPath: /etc/kubernetes/admin.conf name: kubeconfig hostNetwork: true volumes:
hostPath: path: /etc/kubernetes/admin.conf type: FileOrCreate name: kubeconfig status: {}

vrabbi commented 3 years ago

And whats the output of "ip addr"

DennisFaucher commented 3 years ago

Screen Shot 2021-10-17 at 4 14 17 PM

vrabbi commented 3 years ago

So it is publishing the ip as can be seen on eth0 as a secondary IP.

vrabbi commented 3 years ago

Does kubectl still not work?

vrabbi commented 3 years ago

It lookg like the needed pods and IP are up and running

DennisFaucher commented 3 years ago

The cluster works fine. If I reboot the control node VM, it loses its static IP address and I'm not sure why.

On Sun, Oct 17, 2021, 4:22 PM Scott Rosenberg @.***> wrote:

It lookg like the needed pods and IP are up and running

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vmware-tanzu/tanzu-framework/issues/900#issuecomment-945188834, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE5TTXX3ZKOXFBWDEUXNITUHMV6XANCNFSM5GCNYRPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

vrabbi commented 3 years ago

Can you reproduce and run the commands i sent above to see what the status is? Also does it fix itself after some time or does it stay broken?

On Oct 17, 2021 15:39, Dennis Faucher @.***> wrote:

The cluster works fine. If I reboot the control node VM, it loses its static IP address and I'm not sure why.

On Sun, Oct 17, 2021, 4:22 PM Scott Rosenberg @.***> wrote:

It lookg like the needed pods and IP are up and running

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vmware-tanzu/tanzu-framework/issues/900#issuecomment-945188834, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACE5TTXX3ZKOXFBWDEUXNITUHMV6XANCNFSM5GCNYRPA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/vmware-tanzu/tanzu-framework/issues/900#issuecomment-945191117, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ALR7DWEXSSAE4WWJEWBD3KDUHMX6VANCNFSM5GCNYRPA.

DennisFaucher commented 3 years ago

Yes, I will reboot the controller node using vSphere Client Power > Restart Guest OS and update the Issue.

DennisFaucher commented 3 years ago

Of course, now it is working🤦‍♂️. I'll close the issue and re-open if/when it happens again. Thanks for your help. Screen Shot 2021-10-18 at 8 52 03 AM

DennisFaucher commented 3 years ago

30 minutes later, static IP is gone and cluster unresponsive. I'll run the commands and post the output.

DennisFaucher commented 3 years ago

$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:9a:9a:83 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.130/24 brd 192.168.1.255 scope global dynamic eth0
       valid_lft 3050sec preferred_lft 3050sec
    inet6 fe80::250:56ff:fe9a:9a83/64 scope link 
       valid_lft forever preferred_lft forever

=============

$ sudo crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID
70cae152bd2d2       640b7ee0df98b       About an hour ago   Running             kube-scheduler            0                   fafb8d79293cd
e312b31b70fa4       060eb69223237       About an hour ago   Running             kube-controller-manager   0                   99ee6b43a83aa
986602ff1b943       6f7c29e5ac889       About an hour ago   Running             etcd                      0                   f25ef127878e4
3cdef5f08b2e7       05d7f1f146f50       About an hour ago   Running             kube-vip                  0                   ba9933fa9561c
d8297ac133974       0b9437b832f65       About an hour ago   Running             kube-apiserver            0                   326ecc46f596e

=====================

$ cat kube-vip.yaml

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  name: kube-vip
  namespace: kube-system
spec:
  containers:
  - args:
    - start
    env:
    - name: vip_arp
      value: "true"
    - name: vip_leaderelection
      value: "true"
    - name: address
      value: 192.168.1.57
    - name: vip_interface
      value: eth0
    - name: vip_leaseduration
      value: "15"
    - name: vip_renewdeadline
      value: "10"
    - name: vip_retryperiod
      value: "2"
    image: projects.registry.vmware.com/tkg/kube-vip:v0.3.3_vmware.1
    imagePullPolicy: IfNotPresent
    name: kube-vip
    resources: {}
    securityContext:
      capabilities:
        add:
        - NET_ADMIN
        - SYS_TIME
    volumeMounts:
    - mountPath: /etc/kubernetes/admin.conf
      name: kubeconfig
  hostNetwork: true
  volumes:
  - hostPath:
      path: /etc/kubernetes/admin.conf
      type: FileOrCreate
    name: kubeconfig
status: {}

========================= TIA

DennisFaucher commented 3 years ago

Before reboot and lost static IP, ip a looked like this:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:9a:9a:83 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.130/24 brd 192.168.1.255 scope global dynamic eth0
       valid_lft 3227sec preferred_lft 3227sec
    inet 192.168.1.57/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::250:56ff:fe9a:9a83/64 scope link 
       valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether f6:e0:9b:1a:47:ad brd ff:ff:ff:ff:ff:ff
4: genev_sys_6081: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue master ovs-system state UNKNOWN group default qlen 1000
    link/ether 76:6a:07:94:fd:95 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::746a:7ff:fe94:fd95/64 scope link 
       valid_lft forever preferred_lft forever
5: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 4a:5a:97:62:37:dd brd ff:ff:ff:ff:ff:ff
    inet 100.96.0.1/24 brd 100.96.0.255 scope global antrea-gw0
       valid_lft forever preferred_lft forever
    inet6 fe80::485a:97ff:fe62:37dd/64 scope link 
       valid_lft forever preferred_lft forever
6: coredns--38a2c8@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 76:69:e2:d8:40:5c brd ff:ff:ff:ff:ff:ff link-netns cni-37aca1ec-8dcb-c0ac-3767-0190376516f0
    inet6 fe80::7469:e2ff:fed8:405c/64 scope link 
       valid_lft forever preferred_lft forever
7: coredns--def150@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 02:c5:49:35:6f:19 brd ff:ff:ff:ff:ff:ff link-netns cni-ddae3805-2e0a-41c2-656a-9f193c18ce6f
    inet6 fe80::c5:49ff:fe35:6f19/64 scope link 
       valid_lft forever preferred_lft forever
8: tanzu-ca-0cc37c@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master ovs-system state UP group default 
    link/ether 0e:c8:a1:11:2a:6f brd ff:ff:ff:ff:ff:ff link-netns cni-e5ae675f-4408-7f61-71a8-0f6d2dd77dd2
    inet6 fe80::cc8:a1ff:fe11:2a6f/64 scope link 
       valid_lft forever preferred_lft forever

DennisFaucher commented 3 years ago

Tried cheating

$ sudo ip addr add 192.168.1.57/24 dev eth0

$ ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 00:50:56:9a:9a:83 brd ff:ff:ff:ff:ff:ff inet 192.168.1.130/24 brd 192.168.1.255 scope global dynamic eth0 valid_lft 3325sec preferred_lft 3325sec inet 192.168.1.57/24 scope global secondary eth0 valid_lft forever preferred_lft forever inet6 fe80::250:56ff:fe9a:9a83/64 scope link valid_lft forever preferred_lft forever

❯ kubectl get nodes NAME STATUS ROLES AGE VERSION stand-control-plane-8whqb.fios-router.home NotReady control-plane,master 2d23h v1.21.2+vmware.1 stand-md-0-d876bfc78-qt5nr.fios-router.home Ready 2d23h v1.21.2+vmware.1

$ sudo journalctl -u kubelet -r

-- Logs begin at Mon 2021-10-18 08:04:55 UTC, end at Mon 2021-10-18 13:55:07 UTC. -- Oct 18 13:55:07 stand-control-plane-8whqb kubelet[543]: E1018 13:55:07.059871 543 kubelet.go:2291] "Error getting node" err="node \"stand-control-plane-8whqb\" not found"

capv@stand-control-plane-8whqb:/etc/systemd/system$ sudo systemctl status kubelet ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since Mon 2021-10-18 13:57:31 UTC; 49s ago Docs: https://kubernetes.io/docs/home/ Main PID: 1712 (kubelet) Tasks: 13 (limit: 4690) Memory: 30.0M CGroup: /system.slice/kubelet.service └─1712 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yam>

Oct 18 13:58:20 stand-control-plane-8whqb kubelet[1712]: E1018 13:58:20.805794 1712 kubelet.go:2291] "Error getting node" err="node \"stand-control-plane-8whqb\" not found"

DennisFaucher commented 3 years ago

Added eth0 back into netplan

$ cp 01-netcfg.yaml 01-netcfg.yaml.sav

$ sudo vi 01-netcfg.yaml

network: version: 2 renderer: networkd ethernets: ens192: dhcp4: yes dhcp6: yes eth0: dhcp4: no addresses:

192.168.1.57/24
192.168.1.130/24 gateway4: 192.168.1.1 nameservers: addresses: [192.168.1.1]

$ sudo netplan apply

❯ kubectl get nodes NAME STATUS ROLES AGE VERSION stand-control-plane-8whqb.fios-router.home NotReady control-plane,master 2d23h v1.21.2+vmware.1 stand-md-0-d876bfc78-qt5nr.fios-router.home Ready 2d23h v1.21.2+vmware.1

$ sudo systemctl restart kubelet

$ sudo systemctl status kubelet ● kubelet.service - kubelet: The Kubernetes Node Agent Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/kubelet.service.d └─10-kubeadm.conf Active: active (running) since Mon 2021-10-18 14:18:27 UTC; 16s ago Docs: https://kubernetes.io/docs/home/ Main PID: 2299 (kubelet) Tasks: 13 (limit: 4690) Memory: 29.2M CGroup: /system.slice/kubelet.service └─2299 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yam>

Oct 18 14:18:43 stand-control-plane-8whqb kubelet[2299]: E1018 14:18:43.851599 2299 kubelet.go:2291] "Error getting node" err="node \"stand-control-plane-8whqb\" not found"

DennisFaucher commented 3 years ago

vRabbi and I had a lovely Zoom call and it looks like an issue with Antrea and standalone clusters.

DennisFaucher commented 3 years ago

Update: Photon seems to behave better than Ubuntu on Control Plane node reboot. I created a Production Large Photon Management Cluster. Rebooted the control plane node with the static IP address. The static IP address moved to another photon control plane node in the cluster and the cluster remains accessible. With Ubuntu, the static ip address was lost and never returned.

jayunit100 commented 3 years ago

so, btw the 2nd ip address is the antrea OVS one that lives in the pod network
DHCP needs to be user managed , i.e. if you lose DHCP lease on a CAPI VM, your hosed. theres no real workaround.
per @randomvariable , "the (other) cloud providers like AWS automatically create DHCP reservations on first boot"
from talking with others in the cluster-api channel , ALL CAPI providers basically assume the same - if a machine changes or moves IPs, the world will explode

so it really comes down to managing machine's DHCP leases in vsphere, or wherever else, so that they behave as static IPs do.

This is a documentation issue, we need to take existing docs for VMWare Tanzu and make them have the same clarity upstream WRT DHCP and the importance of immutable IPs for nodes.

DennisFaucher commented 3 years ago

Most likely. @vrabbi took off with this.

jayunit100 commented 3 years ago

cc @clintkitson

DennisFaucher commented 3 years ago

Also, to confirm, the controller node static IP address is not coming back. And only when the controller node type is Ubuntu, not Photon.

randomvariable commented 2 years ago

so it really comes down to managing machine's DHCP leases in vsphere, or wherever else, so that they behave as static IPs do.

There's no DHCP server provided by vSphere, so it can't be handled by CAPV alone. vSphere provides L2 segments, and L3 is the network operator's concern. There is however a proposal to integrate IPAM from @schrej that needs reviewing here https://docs.google.com/document/d/1hvirbdV_QTbKBMxgvX045OuV8_f-xZETuH6HENZ0uxQ/edit

DennisFaucher commented 2 years ago

Let me clarify a bit. This has nothing to do with DHCP. When one deploys a TCE standalone cluster to vSphere, one gets a [CLUSTER]-control VM and a [CLUSTER]-md VM. The [CLUSTER]-control VM has three IP addresses 1) The static IPv4 defined during installation 2) A DHCP IPv4 3) An IPv6. The static IPv4 is the address kubectl tries to communicate with for everything. If I reboot [CLUSTER]-control, the static IP disappears and I am only left with 2) & 3).

randomvariable commented 2 years ago

Ah, ok, this is probably cloud-init related then. One for @codenrhoden

randomvariable commented 2 years ago

@DennisFaucher @codenrhoden if we can open up an issue in https://github.com/kubernetes-sigs/image-builder/ , that'd be ideal.

vasicvuk commented 2 years ago

Is there some progress on this issue? I have the same behavior with the latest version

vmware-tanzu / tanzu-framework