projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.7k stars 1.27k forks source link

Network wont initialize on fresh install #8694

Closed igolka97 closed 2 months ago

igolka97 commented 3 months ago

When the calico-node pod started on fresh install, the node stuck in the NetworkReady=false state.

Expected Behavior

When the Calico-node pod started, the node becomes in "ready" state

Current Behavior

I have initited a new Kubernetes single-node cluster using kubeadm. Installed tigera-operator as well as Calico by following this guide

When the calico-node pod started, the node was stuck in the NetworkReady=false state.

Possible Solution

After several attempts to find some solution, I restarted the containerd service and then everything started working.

I got exactly the same behavior when I completely reset the node and initialized it again. And I get the same result over and over again.

Context

I'm trying to startup cluster on on-premise infrasturcture with automation tools

Your Environment

root@node-1:~# kubectl get pod -A
NAMESPACE         NAME                                       READY   STATUS              RESTARTS      AGE
calico-system     calico-kube-controllers-5fd7f74c8d-8smqc   0/1     Pending             0             77m
calico-system     calico-node-vqkr5                          1/1     Running             0             77m
calico-system     calico-typha-5bb76c895c-d58sd              1/1     Running             0             77m
calico-system     csi-node-driver-rwhn6                      0/2     ContainerCreating   0             77m
kube-system       coredns-76f75df574-khvnm                   0/1     Pending             0             93m
kube-system       coredns-76f75df574-twgsw                   0/1     Pending             0             93m
kube-system       etcd-node-1                                1/1     Running             41            94m
kube-system       kube-apiserver-node-1                      1/1     Running             39            94m
kube-system       kube-controller-manager-node-1             1/1     Running             47            94m
kube-system       kube-proxy-xdx5l                           1/1     Running             0             93m
kube-system       kube-scheduler-node-1                      1/1     Running             46            94m
tigera-operator   tigera-operator-6bfc79cb9c-mgz58           1/1     Running             0             77m

Please let me know if I have to provide any additional info btw I tried to reproduce this situation in minikube with 1.28 k8s ver and everything goes well there

cyclinder commented 2 months ago

Could you describe the events of calico-kube-controllers-5fd7f74c8d-8smqc?

igolka97 commented 2 months ago

Thanks for your reply. Output is here

root@node-1:~# kubectl describe pod -n calico-system calico-kube-controllers-5fd7f74c8d-8smqc
Name:                 calico-kube-controllers-5fd7f74c8d-8smqc
Namespace:            calico-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      calico-kube-controllers
Node:                 <none>
Labels:               app.kubernetes.io/name=calico-kube-controllers
                      k8s-app=calico-kube-controllers
                      pod-template-hash=5fd7f74c8d
Annotations:          hash.operator.tigera.io/system: fdde45054a8ae4f629960ce37570929502e59449
                      tigera-operator.hash.operator.tigera.io/tigera-ca-private: 29444b4059d0cf3605da1bc4d3d0d5ee97cbbbce
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/calico-kube-controllers-5fd7f74c8d
Containers:
  calico-kube-controllers:
    Image:           docker.io/calico/kube-controllers:v3.27.3
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Liveness:        exec [/usr/bin/check-status -l] delay=10s timeout=10s period=60s #success=1 #failure=6
    Readiness:       exec [/usr/bin/check-status -r] delay=0s timeout=10s period=30s #success=1 #failure=3
    Environment:
      KUBE_CONTROLLERS_CONFIG_NAME:  default
      DATASTORE_TYPE:                kubernetes
      ENABLED_CONTROLLERS:           node
      FIPS_MODE_ENABLED:             false
      KUBERNETES_SERVICE_HOST:       10.96.0.1
      KUBERNETES_SERVICE_PORT:       443
      CA_CRT_PATH:                   /etc/pki/tls/certs/tigera-ca-bundle.crt
    Mounts:
      /etc/pki/tls/cert.pem from tigera-ca-bundle (ro,path="ca-bundle.crt")
      /etc/pki/tls/certs from tigera-ca-bundle (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nhslt (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  tigera-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tigera-ca-bundle
    Optional:  false
  kube-api-access-nhslt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 CriticalAddonsOnly op=Exists
                             node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                      From               Message
  ----     ------            ----                     ----               -------
  Warning  FailedScheduling  3m3s (x1121 over 3d21h)  default-scheduler  0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
cyclinder commented 2 months ago

It looks like all your k8s nodes are not ready, Could you run the following commands to collect info?

kubectl describe nodes 
journalctl -u kubelet 
igolka97 commented 2 months ago

As I already wrote, I am trying to setup single-node cluster, and yes, it stays in NetworkReady=false state until containerd restart

root@node-1:~# kubectl describe nodes 
Name:               node-1
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=node-1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.0.29/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.244.84.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 05 Apr 2024 00:03:14 +0300
Taints:             node-role.kubernetes.io/control-plane:NoSchedule
                    node.kubernetes.io/not-ready:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  node-1
  AcquireTime:     <unset>
  RenewTime:       Wed, 10 Apr 2024 02:37:11 +0300
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 05 Apr 2024 00:20:17 +0300   Fri, 05 Apr 2024 00:20:17 +0300   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 10 Apr 2024 02:35:10 +0300   Fri, 05 Apr 2024 00:03:13 +0300   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 10 Apr 2024 02:35:10 +0300   Fri, 05 Apr 2024 00:03:13 +0300   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 10 Apr 2024 02:35:10 +0300   Fri, 05 Apr 2024 00:03:13 +0300   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                False   Wed, 10 Apr 2024 02:35:10 +0300   Fri, 05 Apr 2024 00:03:13 +0300   KubeletNotReady              container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
Addresses:
  InternalIP:  192.168.0.29
  Hostname:    node-1
Capacity:
  cpu:                8
  ephemeral-storage:  103107780Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12243472Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  95024129891
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12141072Ki
  pods:               110
System Info:
  Machine ID:                 50d84afc3dd943f4b7fa5e195a474836
  System UUID:                3d922ae8-4f97-4da8-b53a-9645d40f7423
  Boot ID:                    620b6978-3e90-4ce4-aa38-99740a5efb06
  Kernel Version:             5.15.0-101-generic
  OS Image:                   Ubuntu 22.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.28
  Kubelet Version:            v1.29.3
  Kube-Proxy Version:         v1.29.3
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
Non-terminated Pods:          (9 in total)
  Namespace                   Name                                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                ------------  ----------  ---------------  -------------  ---
  calico-system               calico-node-vqkr5                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d2h
  calico-system               calico-typha-5bb76c895c-d58sd       0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d2h
  calico-system               csi-node-driver-rwhn6               0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d2h
  kube-system                 etcd-node-1                         100m (1%)     0 (0%)      100Mi (0%)       0 (0%)         5d2h
  kube-system                 kube-apiserver-node-1               250m (3%)     0 (0%)      0 (0%)           0 (0%)         5d2h
  kube-system                 kube-controller-manager-node-1      200m (2%)     0 (0%)      0 (0%)           0 (0%)         5d2h
  kube-system                 kube-proxy-xdx5l                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d2h
  kube-system                 kube-scheduler-node-1               100m (1%)     0 (0%)      0 (0%)           0 (0%)         5d2h
  tigera-operator             tigera-operator-6bfc79cb9c-mgz58    0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d2h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                650m (8%)   0 (0%)
  memory             100Mi (0%)  0 (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:              <none>

last rows, another is the same:

journalctl -u kubelet
Apr 10 02:42:07 node-1 kubelet[288664]: E0410 02:42:07.320368  288664 kubelet.go:2892] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"
Apr 10 02:42:07 node-1 kubelet[288664]: E0410 02:42:07.402872  288664 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="calico-system/csi-node-driver-rwhn6" podUID="f4698a19-4b45-4116-af82-210094037ee2"
Apr 10 02:42:09 node-1 kubelet[288664]: E0410 02:42:09.403291  288664 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="calico-system/csi-node-driver-rwhn6" podUID="f4698a19-4b45-4116-af82-210094037ee2"
Apr 10 02:42:11 node-1 kubelet[288664]: E0410 02:42:11.403114  288664 pod_workers.go:1298] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="calico-system/csi-node-driver-rwhn6" podUID="f4698a19-4b45-4116-af82-210094037ee2"
~
cyclinder commented 2 months ago

I think this issue is not related to calico, it is more like a contained issue, contaienrd can't find the cni dir, you should restart the contaienrd

igolka97 commented 2 months ago

@cyclinder how did you understand it? I will try to go deeper with this point

cyclinder commented 2 months ago

I've hit this issue before, I think it's not related to calico. I found that containerd's logs report: "No CNI conf file found", but calico is already producing its CNI files normally in /etc/cni/net.d, and then after I restarted containerd, everything is fine, so I suspect that it's containerd that can't dynamically discover files in the cni directory but that's just a guess.

igolka97 commented 2 months ago

In my last attempt, I didn't delete the net.d folder after kubeadm reset. I also rebooted containerd just in case before initializing the new cluster.

I conclude that the containerd process loses the net.d folder if it is deleted and recreated until it is rebooted.

I wanted to understand this issue in order to better understand the internal processes that take place behind the scenes of the k8s system. I hope I made the right conclusion.

Thank you in any case, I will be grateful for any comment. Perhaps this observation can help someone else.