kubewharf / katalyst-core

Katalyst aims to provide a universal solution to help improve resource utilization and optimize the overall costs in the cloud. This is the core components in Katalyst system, including multiple agents and centralized components
Apache License 2.0
389 stars 91 forks source link

Enhanced K8s worker node network error and weird automatic restart #616

Open ozline opened 3 weeks ago

ozline commented 3 weeks ago

What happened?

I followed this documentation to install the dev k8s environment.

My master node is running fine quickly, but the worker nodes are not working due to cni plugin not initialized

Extracted from kubectl describe node debian-node-2

Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 10 Jun 2024 15:46:00 +0800   Mon, 10 Jun 2024 15:46:00 +0800   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Mon, 10 Jun 2024 15:45:56 +0800   Mon, 10 Jun 2024 15:45:50 +0800   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Logs extraced from kubectl logs -n kube-system canal-ftzlb(for debian-node-2

2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=Get "https://172.23.192.1:443/api/v1/namespaces?limit=500": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https://172.23.192.1:443/api/v1/nodes?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" error=Get "https://172.23.192.1:443/api/v1/pods?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=Get "https://172.23.192.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.323 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=Get "https://172.23.192.1:443/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.325 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/hostendpoints?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.560 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/bgpconfigurations"
2024-06-10 08:58:51.587 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/globalnetworksets" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/globalnetworksets?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.731 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networkpolicies" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:51.900 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/clusterinformations"
2024-06-10 08:58:51.926 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/ippools"
2024-06-10 08:58:51.947 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/globalnetworkpolicies" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/globalnetworkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.160 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/felixconfigurations"
2024-06-10 08:58:52.201 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/networksets" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/networksets?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.323 [INFO][63] status-reporter/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/profiles"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesservice"
2024-06-10 08:58:52.324 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices"
2024-06-10 08:58:52.326 [INFO][61] felix/watchercache.go 181: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/hostendpoints"
2024-06-10 08:58:52.454 [INFO][63] status-reporter/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/caliconodestatuses" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/caliconodestatuses?limit=500&resourceVersion=23696&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/profiles" error=Get "https://172.23.192.1:443/api/v1/namespaces?limit=500": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/nodes" error=Get "https://172.23.192.1:443/api/v1/nodes?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesservice" error=Get "https://172.23.192.1:443/api/v1/services?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/workloadendpoints" error=Get "https://172.23.192.1:443/api/v1/pods?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=Get "https://172.23.192.1:443/apis/discovery.k8s.io/v1/endpointslices?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesnetworkpolicies" error=Get "https://172.23.192.1:443/apis/networking.k8s.io/v1/networkpolicies?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused
2024-06-10 08:58:52.454 [INFO][61] felix/watchercache.go 194: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/bgpconfigurations" error=Get "https://172.23.192.1:443/apis/crd.projectcalico.org/v1/bgpconfigurations?limit=500&resourceVersion=0&resourceVersionMatch=NotOlderThan": dial tcp 172.23.192.1:443: connect: connection refused

I checked the logs and it seems that the connection to 172.23.192.1:443 was refused. But when I check iptables, it shows:

root@debian-node-2:~/deploy# sudo iptables-save | grep 172.23.192.1
-A KUBE-SERVICES -d 172.23.192.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-SVC-NPX46M4PTMTKRN6Y
-A KUBE-SERVICES -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-SVC-ERIFXISQEP7F7OF4
-A KUBE-SERVICES -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-SVC-JD5MR3NA4I4DYORP
-A KUBE-SERVICES -d 172.23.192.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-SVC-TCOU7JCQXEZGVUNU
-A KUBE-SVC-ERIFXISQEP7F7OF4 ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:dns-tcp cluster IP" -m tcp --dport 53 -j KUBE-MARK-MASQ
-A KUBE-SVC-JD5MR3NA4I4DYORP ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p tcp -m comment --comment "kube-system/kube-dns:metrics cluster IP" -m tcp --dport 9153 -j KUBE-MARK-MASQ
-A KUBE-SVC-NPX46M4PTMTKRN6Y ! -s 172.28.208.0/20 -d 172.23.192.1/32 -p tcp -m comment --comment "default/kubernetes:https cluster IP" -m tcp --dport 443 -j KUBE-MARK-MASQ
-A KUBE-SVC-TCOU7JCQXEZGVUNU ! -s 172.28.208.0/20 -d 172.23.192.10/32 -p udp -m comment --comment "kube-system/kube-dns:dns cluster IP" -m udp --dport 53 -j KUBE-MARK-MASQ

and I try use curl and telnet to coonnect 172.23.192.1:443 in debian-node-2,it works

This seems to be working fine, I have been debugging this for days, but still have no luck.

I also tried to reinstall several times, but the problem reappeared stably.

Finally, I solved it by systemctl restart containerd.service on debian-node-2(work node)

Although I solved the problem, I still have a huge doubt, why can I just restart containerd?


At the same time, my master node (i.e., debian-node-1) would randomly reboot, which was very confusing to me.

It usually manifested itself in a way similar to client_loop: send disconnect: Broken pipe, which I initially thought was a problem with the ssh connection, but when I left the server alone overnight and connected again the next day, it would automatically reboot again.

I had no memory issues, and even the memory usage was not high. Command journalctl -xb -p err did not indicate any problems.

I have previously installed Vanilla Kubernetes on debian-node-1, and also installed a Kind-based k8s container. But there was no restart problem. I have already done a system reset before installing kubewarf-enhanced-k8s.

debian-node-2 (worker node) have same problem.

This is debian-node-1 and debian-node-2 summary from neofetch. The two machines are connected to the same router, which has a bypass route 192.168.2.201 set up to handle the proxy.

root@debian-node-1:~# neofetch
       _,met$$$$$gg.          root@debian-node-1
    ,g$$$$$$$$$$$$$$$P.       ------------------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64
 ,$$P'              `$$$.     Host: UM480XT
',$$P       ,ggs.     `$$b:   Kernel: 6.1.0-21-amd64
`d$$'     ,$P"'   .    $$$    Uptime: 1 hour
 $$P      d$'     ,    $$P    Packages: 573 (dpkg)
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15
 $$;      Y$b._   _,d$P'      CPU: AMD Ryzen 7 4800H with Radeon Graphics (16) @ 2.900GHz
 Y$$.    `.`"Y$$$$P"'         GPU: AMD ATI 04:00.0 Renoir
 `$$b      "-.__              Memory: 1494MiB / 31529MiB
  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""
root@debian-node-2:~/deploy# neofetch
       _,met$$$$$gg.          root@debian-node-2
    ,g$$$$$$$$$$$$$$$P.       ------------------
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) x86_64
 ,$$P'              `$$$.     Host: UM480XT
',$$P       ,ggs.     `$$b:   Kernel: 6.1.0-21-amd64
`d$$'     ,$P"'   .    $$$    Uptime: 2 hours, 49 mins
 $$P      d$'     ,    $$P    Packages: 517 (dpkg)
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15
 $$;      Y$b._   _,d$P'      CPU: AMD Ryzen 7 4800H with Radeon Graphics (16) @ 2.900GHz
 Y$$.    `.`"Y$$$$P"'         GPU: AMD ATI 04:00.0 Renoir
 `$$b      "-.__              Memory: 703MiB / 15425MiB
  `Y$$
   `Y$$.
     `$$b.
       `Y$$b.
          `"Y$b._
              `"""

What did you expect to happen?

Worker nodes can run normally without systemctl restart containerd.service

Master node will not auto restart

How can we reproduce it (as minimally and precisely as possible)?

followed this documentation

Software version

debian-node-1:

root@debian-node-1:~# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}

debian-node-2:

root@debian-node-2:~/deploy# kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:56:31Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.6-kubewharf.8", GitCommit:"443c2773bbac8eeb5648f22f2b262d05e985595c", GitTreeState:"clean", BuildDate:"2024-01-04T03:51:02Z", GoVersion:"go1.18.6", Compiler:"gc", Platform:"linux/amd64"}