Closed w13915984028 closed 2 years ago
Could you log in to the VM via VNC and dump the ip a
result?
Looks a bit tricky.
F12 page shows mgmt URL: 192.168.122.200
, not ready
but ip addr show dev harvester-mgmt
is 192.168.122.11
, it is pingable from HostOS
admin@provoday0:~> ping 192.168.122.11
PING 192.168.122.11 (192.168.122.11) 56(84) bytes of data.
64 bytes from 192.168.122.11: icmp_seq=1 ttl=64 time=0.347 ms
64 bytes from 192.168.122.11: icmp_seq=2 ttl=64 time=0.324 ms
64 bytes from 192.168.122.11: icmp_seq=3 ttl=64 time=0.305 ms
^C
--- 192.168.122.11 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2049ms
rtt min/avg/max/mdev = 0.305/0.325/0.347/0.022 ms
admin@provoday0:~>
VM has no restart after last installation
rancher@harvmain0112:~> uptime
18:01:45 up 12 days 0:40, 2 users, load average: 11.03, 11.47, 11.40
rancher@harvmain0112:~> ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master harvester-mgmt state UP group default qlen 1000
link/ether 52:54:00:90:e2:26 brd ff:ff:ff:ff:ff:ff
altname enp0s3
3: harvester-mgmt: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 52:54:00:90:e2:26 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.11/24 brd 192.168.122.255 scope global harvester-mgmt
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe90:e226/64 scope link
valid_lft forever preferred_lft forever
6: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 1e:6f:98:12:cd:76 brd ff:ff:ff:ff:ff:ff
inet 10.52.0.0/32 brd 10.52.0.0 scope global flannel.1
valid_lft forever preferred_lft forever
inet6 fe80::1c6f:98ff:fe12:cd76/64 scope link
valid_lft forever preferred_lft forever
7: calib76ee3d87dc@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-313ef4eb-b374-2433-7bde-772aa0ee20b1
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
pods are in terrible state
rancher@harvmain0112:~> kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-fleet-local-system fleet-agent-7cbd6946f9-4xghw 1/1 Running 144 12d
cattle-fleet-system fleet-controller-7765f46db-26gzw 0/1 CrashLoopBackOff 1448 12d
cattle-fleet-system gitjob-95bb5f685-8wgrj 1/1 Running 143 12d
cattle-monitoring-system prometheus-rancher-monitoring-prometheus-0 0/3 ContainerCreating 0 5d20h
cattle-monitoring-system rancher-monitoring-admission-create-nww66 0/1 Completed 0 6d15h
cattle-monitoring-system rancher-monitoring-crd-create-ckjsz 0/1 Completed 0 6d15h
cattle-monitoring-system rancher-monitoring-grafana-7f54b7d8bc-jggld 0/3 Init:0/2 0 5d16h
cattle-monitoring-system rancher-monitoring-kube-state-metrics-744b9448f4-bgdqc 1/1 Running 30 12d
cattle-monitoring-system rancher-monitoring-operator-754bcd8cb4-j9lwc 1/1 Running 2 12d
cattle-monitoring-system rancher-monitoring-prometheus-adapter-77568b975-jdn8z 0/1 Error 1477 12d
cattle-monitoring-system rancher-monitoring-prometheus-node-exporter-dgdv7 1/1 Running 77 12d
cattle-system harvester-cluster-repo-6d7777b9c7-mcwg8 1/1 Running 2 12d
cattle-system rancher-7b76fb5dd5-qknw7 0/1 Running 1504 12d
cattle-system rancher-webhook-fcd8cdc88-g8pph 1/1 Running 1 12d
cattle-system system-upgrade-controller-7c878c4798-n2tp4 1/1 Running 0 12d
harvester-system harvester-5bd4876c66-dzfsn 1/1 Running 75 11d
harvester-system harvester-load-balancer-5b4b949748-5nfg6 1/1 Running 145 12d
harvester-system harvester-network-controller-5nl26 1/1 Running 1 12d
harvester-system harvester-network-controller-manager-7769fd599d-dhqn7 1/1 Running 149 12d
harvester-system harvester-network-controller-manager-7769fd599d-gmz2v 1/1 Running 143 12d
harvester-system harvester-node-disk-manager-wtq4z 1/1 Running 2 12d
harvester-system harvester-webhook-7f568f68fb-vb9vn 0/1 Pending 0 11d
harvester-system harvester-webhook-98575b94b-xb6j9 1/1 Running 0 11d
harvester-system kube-vip-cloud-provider-0 1/1 Running 157 12d
harvester-system kube-vip-sjgt4 1/1 Running 194 12d
harvester-system virt-api-86455cdb7d-8ch6b 0/1 Running 0 12d
harvester-system virt-api-86455cdb7d-vz6qc 0/1 Running 1 12d
harvester-system virt-controller-5f649999dd-bl7k5 0/1 Running 1805 12d
harvester-system virt-controller-5f649999dd-rp4tw 0/1 CrashLoopBackOff 1801 12d
harvester-system virt-handler-z49mh 0/1 Running 1387 12d
harvester-system virt-operator-56c5bdc7b8-9v2tf 0/1 CrashLoopBackOff 1442 12d
kube-system cloud-controller-manager-harvmain0112 1/1 Running 189 12d
kube-system etcd-harvmain0112 1/1 Running 7 4d16h
kube-system helm-install-rke2-canal-8czdv 0/1 Completed 0 12d
kube-system helm-install-rke2-coredns-kbvc7 0/1 Completed 0 12d
kube-system helm-install-rke2-ingress-nginx-7nfdc 0/1 Completed 0 12d
kube-system helm-install-rke2-metrics-server-tj9bq 0/1 Completed 0 12d
kube-system helm-install-rke2-multus-bdzc4 0/1 Completed 0 12d
kube-system kube-apiserver-harvmain0112 1/1 Running 0 9d
kube-system kube-controller-manager-harvmain0112 1/1 Running 190 12d
kube-system kube-multus-ds-hznds 1/1 Running 1 12d
kube-system kube-proxy-harvmain0112 1/1 Running 1 12d
kube-system kube-scheduler-harvmain0112 1/1 Running 183 12d
kube-system rke2-canal-dscrn 1/2 CrashLoopBackOff 1929 12d
kube-system rke2-coredns-rke2-coredns-7bb4f446c-nbxcv 1/1 Running 1 12d
kube-system rke2-coredns-rke2-coredns-autoscaler-7c58bd5b6c-4xhh4 1/1 Running 52 12d
kube-system rke2-ingress-nginx-controller-2hbjc 0/1 CrashLoopBackOff 2011 12d
kube-system rke2-metrics-server-5df7d77b5b-rx8v7 0/1 CrashLoopBackOff 1473 12d
kube-system snapshot-controller-9f68fdd9-cc86j 1/1 Running 154 12d
kube-system snapshot-controller-9f68fdd9-scdsk 1/1 Running 143 12d
longhorn-system backing-image-manager-c00e-ecd3 0/1 Running 0 5d9h
longhorn-system csi-attacher-66fcbbff5c-9mz49 1/1 Running 151 12d
longhorn-system csi-attacher-66fcbbff5c-hqh5d 1/1 Running 140 12d
longhorn-system csi-attacher-66fcbbff5c-xqmjz 1/1 Running 145 12d
longhorn-system csi-provisioner-84fcfbf785-6r5hl 0/1 CrashLoopBackOff 1447 12d
longhorn-system csi-provisioner-84fcfbf785-p4lxg 0/1 CrashLoopBackOff 1434 12d
longhorn-system csi-provisioner-84fcfbf785-swj2j 0/1 CrashLoopBackOff 1440 12d
longhorn-system csi-resizer-58ff455cdb-4wmsc 1/1 Running 126 12d
longhorn-system csi-resizer-58ff455cdb-gbrzw 1/1 Running 122 12d
longhorn-system csi-resizer-58ff455cdb-kcgfr 1/1 Running 120 12d
longhorn-system csi-snapshotter-59f5cd8b8c-4xfbl 1/1 Running 135 12d
longhorn-system csi-snapshotter-59f5cd8b8c-dmdm7 1/1 Running 127 12d
longhorn-system csi-snapshotter-59f5cd8b8c-kxqjr 1/1 Running 125 12d
longhorn-system engine-image-ei-a6c8003e-q74hj 1/1 Running 0 12d
longhorn-system longhorn-csi-plugin-nds9l 2/2 Running 3 12d
longhorn-system longhorn-driver-deployer-97d65ccb8-tt2dg 1/1 Running 0 12d
longhorn-system longhorn-manager-2wd5g 0/1 Running 9 12d
longhorn-system longhorn-post-upgrade-77dq6 0/1 Completed 0 6d16h
longhorn-system longhorn-ui-55cb5cdc88-5q8mt 1/1 Running 3 12d
rancher@harvmain0112:~>
rancher@harvmain0112:~> kubectl get service -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cattle-fleet-system gitjob ClusterIP 10.53.150.72 <none> 80/TCP 12d
cattle-monitoring-system prometheus-operated ClusterIP None <none> 9090/TCP 12d
cattle-monitoring-system rancher-monitoring-grafana ClusterIP 10.53.124.24 <none> 80/TCP 12d
cattle-monitoring-system rancher-monitoring-kube-state-metrics ClusterIP 10.53.84.253 <none> 8080/TCP 12d
cattle-monitoring-system rancher-monitoring-operator ClusterIP 10.53.55.64 <none> 443/TCP 12d
cattle-monitoring-system rancher-monitoring-prometheus ClusterIP 10.53.62.99 <none> 9090/TCP 12d
cattle-monitoring-system rancher-monitoring-prometheus-adapter ClusterIP 10.53.233.76 <none> 443/TCP 12d
cattle-monitoring-system rancher-monitoring-prometheus-node-exporter ClusterIP 10.53.173.98 <none> 9796/TCP 12d
cattle-system harvester-cluster-repo ClusterIP 10.53.83.100 <none> 80/TCP 12d
cattle-system rancher ClusterIP 10.53.128.146 <none> 80/TCP,443/TCP 12d
cattle-system rancher-webhook ClusterIP 10.53.191.69 <none> 443/TCP 12d
cattle-system webhook-service ClusterIP 10.53.41.133 <none> 443/TCP 12d
default kubernetes ClusterIP 10.53.0.1 <none> 443/TCP 12d
harvester-system harvester ClusterIP 10.53.85.35 <none> 8443/TCP 12d
harvester-system harvester-webhook ClusterIP 10.53.120.75 <none> 443/TCP 12d
harvester-system kubevirt-operator-webhook ClusterIP 10.53.52.244 <none> 443/TCP 12d
harvester-system kubevirt-prometheus-metrics ClusterIP 10.53.111.120 <none> 443/TCP 12d
harvester-system virt-api ClusterIP 10.53.219.64 <none> 443/TCP 12d
kube-system ingress-expose LoadBalancer 10.53.233.196 192.168.122.200 443:31255/TCP,80:32106/TCP 12d
kube-system rancher-monitoring-coredns ClusterIP None <none> 9153/TCP 12d
kube-system rancher-monitoring-kubelet ClusterIP None <none> 10250/TCP,10255/TCP,4194/TCP 12d
kube-system rke2-coredns-rke2-coredns ClusterIP 10.53.0.10 <none> 53/UDP,53/TCP 12d
kube-system rke2-ingress-nginx-controller-admission ClusterIP 10.53.167.197 <none> 443/TCP 12d
kube-system rke2-metrics-server ClusterIP 10.53.50.165 <none> 443/TCP 12d
longhorn-system csi-attacher ClusterIP 10.53.97.224 <none> 12345/TCP 12d
longhorn-system csi-provisioner ClusterIP 10.53.208.225 <none> 12345/TCP 12d
longhorn-system csi-resizer ClusterIP 10.53.223.141 <none> 12345/TCP 12d
longhorn-system csi-snapshotter ClusterIP 10.53.52.108 <none> 12345/TCP 12d
longhorn-system longhorn-backend ClusterIP 10.53.3.200 <none> 9500/TCP 12d
longhorn-system longhorn-engine-manager ClusterIP None <none> <none> 12d
longhorn-system longhorn-frontend ClusterIP 10.53.118.5 <none> 80/TCP 12d
longhorn-system longhorn-replica-manager ClusterIP None <none> <none> 12d
kubectl logs deployment/fleet-controller -n cattle-fleet-system
Error: Get "https://10.53.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions": dial tcp 10.53.0.1:443: connect: no route to host
..
time="2021-12-13T18:06:34Z" level=fatal msg="Get \"https://10.53.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions\": dial tcp 10.53.0.1:443: connect: no route to
So the node IP is accessible, VIP(192.168.122.200) is not. Logs from the kubevip pod:
E1214 02:08:46.546969 1 leaderelection.go:325] error retrieving resource lock harvester-system/plndr-svcs-lock: Get "https://10.53.0.1:443/apis/coordination.k8s.io/v1/namespaces/harvester-system/leases/plndr-svcs-lock": dial tcp 10.53.0.1:443: connect: no route to host
Check the endpoint:
$ kubectl get ep
NAME ENDPOINTS AGE
kubernetes 192.168.122.13:6443 12d
Note that the registered node IP(192.168.122.13) is different from the current one(192.168.122.11). The cause of the problem is that the node IP is changed.
So the node IP is accessible, VIP(192.168.122.200) is not. Logs from the kubevip pod:
E1214 02:08:46.546969 1 leaderelection.go:325] error retrieving resource lock harvester-system/plndr-svcs-lock: Get "https://10.53.0.1:443/apis/coordination.k8s.io/v1/namespaces/harvester-system/leases/plndr-svcs-lock": dial tcp 10.53.0.1:443: connect: no route to host
Check the endpoint:
$ kubectl get ep NAME ENDPOINTS AGE kubernetes 192.168.122.13:6443 12d
Note that the registered node IP(192.168.122.13) is different from the current one(192.168.122.11). The cause of the problem is that the node IP is changed.
-- After Harvester installation, the node(VM) runs 12+ days without rebooting, which module may change the node IP ?
The node(VM) runs on-top of KVM, when booting, it gets IP address from KVM, after booting, the IP normally keeps unchanged.
I saw that the NIC uses DHCP mode. Could it be a misconfiguration or loss of leases on the DHCP server?
I saw that the NIC uses DHCP mode. Could it be a misconfiguration or loss of leases on the DHCP server?
The node VM is attached to "default" network of KVM, the DHCP is an embedded functionality/feature of KVM, provides 192.168.122.* to guest VM, and also NAT feature for internet access.
Meanwhile, another VM, ubuntu server, is in same mode as Harvester VM, it`s IP is never changed after booting (30+ days).
The difference is, Harvester VM creates a harvester-mgmt interface/dev, removes the original node IP from "ens3/enp0s3" and attaches it to mgmt interface. Maybe this behavior trigs some tricky thing.
harvester VM
2: ens3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master harvester-mgmt state UP group default qlen 1000
link/ether 52:54:00:90:e2:26 brd ff:ff:ff:ff:ff:ff
altname enp0s3
3: harvester-mgmt: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 52:54:00:90:e2:26 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.11/24 brd 192.168.122.255 scope global harvester-mgmt
valid_lft forever preferred_lft forever
inet6 fe80::5054:ff:fe90:e226/64 scope link
valid_lft forever preferred_lft forever
rancher@ubuntuvmday0:~$ ip addr
2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 52:54:00:10:4f:08 brd ff:ff:ff:ff:ff:ff
inet 192.168.122.85/24 brd 192.168.122.255 scope global dynamic enp1s0
valid_lft 2159sec preferred_lft 2159sec
inet6 fe80::5054:ff:fe10:4f08/64 scope link
valid_lft forever preferred_lft forever
HostOS KVM "default" network: NAT mode; DHCP for guest IP provision
admin@provoday0:~> sudo virsh net-dumpxml default
<network connections='2'>
<name>default</name>
<uuid>df36fd7c-e2f8-4910-b45a-d3bf1238c919</uuid>
<forward mode='nat'>
<nat>
<port start='1024' end='65535'/>
</nat>
</forward>
<bridge name='virbr0' stp='on' delay='0'/>
<mac address='52:54:00:ea:e2:f8'/>
<ip address='192.168.122.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.122.2' end='192.168.122.254'/>
<bootp file='ipxe-create' server='192.168.122.85'/> --------> this config is for PXE test, normally there is no such line
</dhcp>
</ip>
</network>
admin@provoday0:~>
The PROVO DAY0 KVM "default" network can not use "bridge" mode, I tried, the PROVO DC DHCP server does not allocate IP to VM on-top of Provo DAY0, it is in MAC-IP binding mode, the VM MAC is virtual generated and not recorded in DC DHCP Server.
We might need to check if the DHCP lease renewal for the mgmt bond proceeds as expected
KVM has following debug info, VIP is not in DHCP guest list, there is no DHCP history info.
guess: cluster components/network first has issue, then the VIP controller fail to renew VIP via DHCP (from KVM), finally the VIP is lost
virsh # domifaddr hmain2911 --full
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet54 52:54:00:90:e2:26 ipv4 192.168.122.11/24
virsh # domifaddr ubuntu20.04 --full
Name MAC address Protocol Address
-------------------------------------------------------------------------------
vnet0 52:54:00:10:4f:08 ipv4 192.168.122.85/24
virsh # net-dhcp-leases --network default
Expiry Time MAC address Protocol IP address Hostname Client ID or DUID
---------------------------------------------------------------------------------------------------------------------------------------------------
2021-12-14 04:59:16 52:54:00:10:4f:08 ipv4 192.168.122.85/24 ubuntuvmday0 ff:56:50:4d:98:00:02:00:00:ab:11:b5:85:b2:62:9e:64:3c:a1
2021-12-14 05:02:29 52:54:00:90:e2:26 ipv4 192.168.122.11/24 harvmain0112 ff:00:90:e2:26:00:01:00:01:29:3a:6b:a3:52:54:00:90:e2:26
provoday0:/home/admin # dmesg | grep vnet54
[2411830.673235] virbr0: port 2(vnet54) entered blocking state
[2411830.673239] virbr0: port 2(vnet54) entered disabled state
[2411830.673384] device vnet54 entered promiscuous mode
[2411830.673612] virbr0: port 2(vnet54) entered blocking state
[2411830.673614] virbr0: port 2(vnet54) entered listening state
[2411832.685161] virbr0: port 2(vnet54) entered learning state
[2411834.701128] virbr0: port 2(vnet54) entered forwarding state
provoday0:/home/admin #
[2411834.701139] virbr0: topology change detected, propagating
[2421438.859043] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[2507898.036183] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[2579281.008378] FS-Cache: Loaded
[2579281.088907] RPC: Registered named UNIX socket transport module.
[2579281.088911] RPC: Registered udp transport module.
[2579281.088912] RPC: Registered tcp transport module.
[2579281.088913] RPC: Registered tcp NFSv4.1 backchannel transport module.
[2579281.167505] FS-Cache: Netfs 'nfs' registered for caching
[2579281.275462] Key type dns_resolver registered
[2579281.535414] NFS: Registering the id_resolver key type
[2579281.535427] Key type id_resolver registered
[2579281.535428] Key type id_legacy registered
[2594356.922544] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[2680815.803031] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[2767274.681385] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[2853733.629418] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[2940192.577587] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[3026651.395416] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[3113083.657239] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[3199509.337159] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[3285968.328645] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[3372427.394127] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
[3458886.201840] BTRFS info (device sda7): qgroup scan completed (inconsistency flag cleared)
provoday0:/home/admin #
I tried to reproduce this issue, and have some interesting finding. https://github.com/w13915984028/harvester-develop-summary/issues/1
The VIP is seeing aged from KVM DHCP list.
The VIP is allocated from KVM using MAC 6a:10:d2:c2:6a:3b of vip-7e465ab1@harvester-mgmt, but it is attached to harvester-mgmt
It looks, when the Harvester goes into "ready" state, the VIP is last re-leased from DHCP server, after that, no renewing.
Which module is responsible for renewing the VIP? If it try to get VIP (ipv4) from dev vip-7e465ab1@harvester-mgmt
, then it will fail, the VIP is attached to dev harvester-mgmt
.
Update: In this environment, when VIP is set as static IP address, the issues is not encountered.
Another related: https://github.com/harvester/harvester/issues/1681
With Harvester master-head ISO, tested:
the issue is not encountered.
Close the issue now.
Describe the bug
Hardware: Provo DAY0, single rack HP server HostOS: SUSE SLE15 SP2, KVM VM1: ubuntu server, VM2: Harvester main node, both using NAT to access the internet
I install the harvester in VM2 (single node), after 2 or 3 days, the VM2 is not accessible from VIP/node IP, ping fail from host OS, unless shutdown the VM2 and restart it again. Every a few day, the VM2 lost its network. But the VM1 has never encountered the network issue.
Encounter this issue frequently in this specific environment.
To Reproduce Steps to reproduce the behavior:
Expected behavior
The Harvester should keep running, network is reachable
Support bundle
Environment:
Additional context Add any other context about the problem here.