rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.56k stars 268 forks source link

rke2-coredns stuck on containercreating, failed to create pod sandbox #5905

Closed R00tedSec closed 5 months ago

R00tedSec commented 6 months ago

Environmental Info: RKE2 Version: rke2 version v1.28.9+rke2r1 (07bf87f9118c1386fa73f660142cc28b5bef1886) go version go1.21.9 X:boringcrypto

Node(s) CPU architecture, OS, and Version: Linux rke2-lab-master-1 6.1.0-20-cloud-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux

Cluster Configuration:

HTTP_PROXY= <proxy>
HTTPS_PROXY=<proxy> 
NO_PROXY=127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16,.svc,.cluster.local,localhost

Describe the bug: When attempting to deploy the CoreDNS pod on agents nodes after successfully creating an RKE2 cluster.

CoreDNS pods in the agents are stuck on ContainerCreating with the following error

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5580fe9275de080781b229bcb82c0b7dd05af5e2d9b5366af2c978f94ec842d": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost

Steps To Reproduce:

Expected behavior: CoreDNS Containers should start as expected and provide DNS service to the pods on the agent nodes.

Actual behavior: CoreDNS pods on the agent nodes remain stuck on "ContainerCreating" state, with the error message mentioned above.

Additional context / logs: No further logs were found. Both Hardened-Calico and Hardened-Flannel containers appear to be functioning correctly.

manuelbuil commented 6 months ago

Kubelet is complaining because it can't find calico binary. What cni did you choose? Can you check if the CNI agent pod is running on that node?

R00tedSec commented 6 months ago

Sure, the pods are running correctly

NAME                                                   READY   STATUS              RESTARTS      AGE     IP             NODE                    NOMINATED NODE   READINESS GATES
rke2-canal-476ws                                       2/2     Running             0             58m     172.27.1.101   rke2-lab-worker-1   <none>           <none>
rke2-canal-g46sf                                       2/2     Running             0             24h     172.27.1.100   rke2-lab-master-1   <none>           <none>
rke2-canal-nm2jh                                       2/2     Running             0             58m     172.27.1.102   rke2-lab-worker-2   <none>           <none>
rke2-coredns-rke2-coredns-75c8f68666-rz67j             0/1     ContainerCreating   0             52m     <none>         rke2-lab-worker-2   <none>           <none>
rke2-coredns-rke2-coredns-75c8f68666-zxkmt             0/1     ContainerCreating   0             52m     <none>         rke2-lab-worker-1   <none>           <none>
rke2-coredns-rke2-coredns-7bdc89bfd7-889qb             1/1     Running             0             6h2m    10.42.0.102    rke2-lab-master-1   <none>           <none>

Also i've checked the logs, in rke2-canal-nm2jh nothing seems to be failing , i can share with you logs , but this is what appears as WARNING

2024-05-09 14:51:29.611 [WARNING][1] cni-installer/<nil> <nil>: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-05-09 14:51:29.624 [WARNING][1] cni-installer/<nil> <nil>: Failed to remove 10-calico.conflist error=remove /host/etc/cni/net.d/10-calico.conflist: no such file or directory
2024-05-09 14:51:31.819 [WARNING][9] startup/winutils.go 144: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-05-09 14:51:32.971 [WARNING][47] cni-config-monitor/winutils.go 144: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-05-09 14:51:32.976 [WARNING][47] cni-config-monitor/winutils.go 144: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-05-09 14:51:33.075 [WARNING][48] felix/int_dataplane.go 553: Failed to cleanup preexisting XDP state error=cannot find XDP object "/usr/lib/calico/bpf/filter.o"

This are the cni folders in the worker nodes

system@rke2-lab-worker-2:/etc# tree /opt/cni/
/opt/cni/
└── bin
    ├── bandwidth
    ├── bridge
    ├── calico
    ├── calico-ipam
    ├── dhcp
    ├── dummy
    ├── firewall
    ├── flannel
    ├── host-device
    ├── host-local
    ├── ipvlan
    ├── loopback
    ├── macvlan
    ├── portmap
    ├── ptp
    ├── sbr
    ├── static
    ├── tap
    ├── tuning
    ├── vlan
    └── vrf

2 directories, 21 files
system@rke2-lab-worker-2:/etc# tree /etc/cni/
/etc/cni/
└── net.d
    ├── 10-canal.conflist
    └── calico-kubeconfig

2 directories, 2 files
R00tedSec commented 6 months ago

Events

  Type     Reason                  Age                      From               Message
  ----     ------                  ----                     ----               -------
  Warning  FailedScheduling        14m                      default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules..
  Warning  FailedScheduling        9m22s (x6 over 14m)      default-scheduler  0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 3 node(s) didn't match pod anti-affinity rules..
  Normal   Scheduled               9m3s                     default-scheduler  Successfully assigned kube-system/rke2-coredns-rke2-coredns-75c8f68666-gbp7m to rke2-soc-lab-worker-2
  Warning  FailedCreatePodSandBox  8m16s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3403f13338157694d8bb11315b5418d059da1261acf117a6f1b969ae38968022": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  7m30s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "806b36f566ef24e206c8136ea871f713230d28da94fc6c221e8f576806fc9738": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  6m44s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "19d02e29bbc4022c4cf9c28e68cf69695f1b2e2411a9ba696a438e38740f8f9e": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  5m58s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "841d8216b7a514a1285098f2ee9fbb1326ef3456c858fd3a8149601daa55a1f2": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  5m12s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "1c3f063f83e0364a54f1003b2e4243d412b383928efd5377a0900eee4f9bf9c1": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  4m26s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "b1a2e3303d0a9d22b6e31e00832cc1ef5be7e5eeca89018cd1274b19b9634aee": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  3m40s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "266b2da113156e25ae01b2d6659e5d95e4148c8b3d50c4c6d5d21cb238f70f99": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  2m54s                    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "a99eecf2e365b73fc866396d3cf10c79e9c18265987339e89e4d9f5cb23ed43f": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  2m8s                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4c7a219cbacc562c1563b7cb5a40a777f3bf06d082bfb16ef2ba2761487157f8": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
  Warning  FailedCreatePodSandBox  <invalid> (x6 over 82s)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "25a4dd055148ae6e0580e8f37506dc6415b69ef1d5f7af13ac72d4965ea670a6": plugin type="calico" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost
manuelbuil commented 6 months ago

Sorry, I read the issue too quickly. It can indeed find the calico binary but calico is throwing the error unexpected error when reading response body. Please retry. Original error: http2: client connection lost

brandond commented 6 months ago

Make sure that you've disabled any host firewalls (firewalld/ufw) or other endpoint protection products. It sounds like something is blocking the connection between calico and the apiserver.

R00tedSec commented 6 months ago

There are no host firewalls or other forms of endpoint protection on the host. However, I discovered a related issue involving the RHEL cloud provider and NetworkManager. Given that I'm utilizing the Debian cloud init image with NetworkManager, could something related to this be causing interference?

I've already applied the workaround proposed, and still does not work

NetworkManager Known Issue

brandond commented 6 months ago

Check the containerd.log to see if there are any additional error messages? Also confirm that kube-proxy is running on this node?

R00tedSec commented 6 months ago

Kube-proxy is up and running

kubectl get pods -A | grep kube-proxy 

kube-system                       kube-proxy-rke-lab-master-1                        1/1     Running             0             36m
kube-system                       kube-proxy-rke-lab-worker-1                        1/1     Running             0             34m
kube-system                       kube-proxy-rke-lab-worker-2                        1/1     Running             0             34m

There is not much more info in containerd.log

cat /var/lib/rancher/rke2/agent/containerd/containerd.log  | grep rke2-coredns-rke2-coredns-84b9cb946c-69rb2

time="2024-05-13T08:17:17.696520957Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,}"
time="2024-05-13T08:18:02.854158772Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,} failed, error" error="failed to setup network for sandbox \"bace4a8a62b1bbcce5cae41696ced3f5d0ad952f5a5da5e3606c6fb1552dbf49\": plugin type=\"calico\" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost"
time="2024-05-13T08:18:03.174299321Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,}"
time="2024-05-13T08:18:48.326710775Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,} failed, error" error="failed to setup network for sandbox \"09daf9489e7ddfb880efd83c67cd75649c017e21b309c957fcc6092568928c3c\": plugin type=\"calico\" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost"
time="2024-05-13T08:18:49.281318025Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,}"
time="2024-05-13T08:19:34.449113980Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,} failed, error" error="failed to setup network for sandbox \"e559cd966f4f22174dc54bdef6839ac13a33d50deb3a5fc881debdcc2467a653\": plugin type=\"calico\" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost"
time="2024-05-13T08:19:35.373561400Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,}"
time="2024-05-13T08:20:20.537206475Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,} failed, error" error="failed to setup network for sandbox \"8592e36dcc717f193f5636a8d2aa02ee0dd358c0d4cf49c3c3a7ed65a18475f9\": plugin type=\"calico\" failed (add): unexpected error when reading response body. Please retry. Original error: http2: client connection lost"
time="2024-05-13T08:20:21.467084624Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:rke2-coredns-rke2-coredns-84b9cb946c-69rb2,Uid:70559555-eed4-437f-a8c9-91629a5fe412,Namespace:kube-system,Attempt:0,}"
R00tedSec commented 6 months ago

Kube proxy also indicates communication failure.

kubectl logs kube-proxy-rke-lab-worker-2 -n kube-system
I0513 07:52:58.875361       1 node.go:141] Successfully retrieved node IP: 172.27.1.102
I0513 07:52:58.893030       1 server.go:632] "kube-proxy running in dual-stack mode" primary ipFamily="IPv4"
I0513 07:52:58.893814       1 server_others.go:152] "Using iptables Proxier"
I0513 07:52:58.893832       1 server_others.go:421] "Detect-local-mode set to ClusterCIDR, but no cluster CIDR for family" ipFamily="IPv6"
I0513 07:52:58.893836       1 server_others.go:438] "Defaulting to no-op detect-local"
I0513 07:52:58.893854       1 proxier.go:250] "Setting route_localnet=1 to allow node-ports on localhost; to change this either disable iptables.localhostNodePorts (--iptables-localhost-nodeports) or set nodePortAddresses (--nodeport-addresses) to filter loopback addresses"
I0513 07:52:58.893991       1 server.go:846] "Version info" version="v1.28.9+rke2r1"
I0513 07:52:58.894000       1 server.go:848] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0513 07:52:58.894455       1 config.go:97] "Starting endpoint slice config controller"
I0513 07:52:58.894467       1 config.go:315] "Starting node config controller"
I0513 07:52:58.894474       1 shared_informer.go:311] Waiting for caches to sync for node config
I0513 07:52:58.894467       1 shared_informer.go:311] Waiting for caches to sync for endpoint slice config
I0513 07:52:58.894577       1 config.go:188] "Starting service config controller"
I0513 07:52:58.894583       1 shared_informer.go:311] Waiting for caches to sync for service config
I0513 07:52:58.994719       1 shared_informer.go:318] Caches are synced for endpoint slice config
I0513 07:52:58.994727       1 shared_informer.go:318] Caches are synced for service config
I0513 07:52:58.994753       1 shared_informer.go:318] Caches are synced for node config
W0513 07:54:59.982396       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 07:54:59.982440       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 07:54:59.982477       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:06:31.655352       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:06:31.655396       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:06:31.655412       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:13:41.050945       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:13:41.050944       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:13:41.050968       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:15:29.352391       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Node ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:15:29.352399       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.EndpointSlice ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0513 08:15:29.352397       1 reflector.go:458] k8s.io/client-go/informers/factory.go:150: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
brandond commented 5 months ago

Something on your node is blocking communication. Please investigate what that might be.

burlyunixguy commented 5 months ago

I just went through something similar and it turned out to be corrupted vxlan packets (bad udp cksum). For my case it was flannel, but I did see some other folks have issues with calico as well. Symptoms pointed to OS firewall or kube-proxy.. but in the end, they were actually okay. Related to vms running on vmware.

Check out these discussions:
https://github.com/projectcalico/calico/issues/3145 https://github.com/flannel-io/flannel/blob/master/Documentation/troubleshooting.md

The flannel discussion specifies how to make the change persistent via udev rules.

R00tedSec commented 5 months ago

Thanks for the info, @burlyunixguy . That seems to be the problem, this cluster is being executed in VMs deployed in a SDN VXLAN in proxmox. But this morning, I was able to spin up a cluster with SNAT enabled in Proxmox, and everything started okay.

I'm not sure if it's exactly related to the discussion you mentioned.

R00tedSec commented 5 months ago

It turns out the problem was the MTU setting of the VXLAN. You can check out more details here: MTU Considerations for VXLAN.

Sorry for the misunderstanding!