projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.94k stars 1.32k forks source link

Calico pods crashing in EKS AL2023 SELINUX issue #9321

Open Galphaa opened 2 days ago

Galphaa commented 2 days ago

I was switching from Amazon Linux 2 EKS to Amazon Linux 2023 and after migrating AMI I got all my pods crashing kubernetes version 1.28 EKS AL2023 and Calico Manifest instalation 3.26.4

Steps to Reproduce (for bugs)

Calico pod logs

Controller describe `Containers: calico-kube-controllers: Port: Host Port: State: Running Started: Wed, 09 Oct 2024 18:09:13 +0400 Last State: Terminated Reason: Error Exit Code: 2 Started: Wed, 09 Oct 2024 18:02:34 +0400 Finished: Wed, 09 Oct 2024 18:04:04 +0400 Ready: True Restart Count: 38 Liveness: exec [/usr/bin/check-status -l] delay=10s timeout=10s period=10s #success=1 #failure=6 Readiness: exec [/usr/bin/check-status -r] delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: ENABLED_CONTROLLERS: node DATASTORE_TYPE: kubernetes Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-wh7sc (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: kube-api-access-wh7sc: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: kubernetes.io/os=linux Tolerations: CriticalAddonsOnly op=Exists node-role.kubernetes.io/control-plane:NoSchedule node-role.kubernetes.io/master:NoSchedule node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message


Normal Killing 39m (x32 over 3h14m) kubelet Stopping container calico-kube-controllers Normal Pulled 34m (x32 over 3h14m) kubelet Container image "ge.ecr.ge-west-1.amazonaws.com/calico-kube-controllers:v3.26.4" already present on machine Warning BackOff 4m47s (x851 over 3h14m) kubelet Back-off restarting failed container calico-kube-controllers in pod calico-kube-controllers-8685c56787-4nfrm_kube-system(5260d52e-a763-4d3c-bb40-1a191b0e24d3)`

Calico pod describe

` Host Port: Command: /opt/cni/bin/calico-ipam -upgrade State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 09 Oct 2024 18:04:07 +0400 Finished: Wed, 09 Oct 2024 18:04:07 +0400 Ready: True Restart Count: 7 Environment Variables from: kubernetes-services-endpoint ConfigMap Optional: true Environment: KUBERNETES_NODE_NAME: (v1:spec.nodeName) CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false FELIX_AWSSRCDSTCHECK: Disable Mounts: /host/opt/cni/bin from cni-bin-dir (rw) /var/lib/cni/networks from host-local-net-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zktff (ro) install-cni: Container ID: containerd://01054a69f401c0fcc052477d2ac09b6b22c7c6ec2d9d9d246a6916c82dc6b453 Port: Host Port: Command: /opt/cni/bin/install State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 09 Oct 2024 18:04:08 +0400 Finished: Wed, 09 Oct 2024 18:04:19 +0400 Ready: True Restart Count: 0 Environment Variables from: kubernetes-services-endpoint ConfigMap Optional: true Environment: CNI_CONF_NAME: 10-calico.conflist CNI_NETWORK_CONFIG: <set to the key 'cni_network_config' of config map 'calico-config'> Optional: false KUBERNETES_NODE_NAME: (v1:spec.nodeName) CNI_MTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false SLEEP: false FELIX_AWSSRCDSTCHECK: Disable Mounts: /host/etc/cni/net.d from cni-net-dir (rw) /host/opt/cni/bin from cni-bin-dir (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zktff (ro) mount-bpffs: Container ID: containerd://0b459d688ee89be30210e9cfa3f1d5efafc8cb7d8c0fa02c99cdb4073f9a1f55 Host Port: Command: calico-node -init -best-effort State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 09 Oct 2024 18:04:19 +0400 Finished: Wed, 09 Oct 2024 18:04:19 +0400 Ready: True Restart Count: 0 Environment: FELIX_AWSSRCDSTCHECK: Disable Mounts: /nodeproc from nodeproc (ro) /sys/fs from sys-fs (rw) /var/run/calico from var-run-calico (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zktff (ro) Containers: calico-node: Host Port: State: Running Started: Wed, 09 Oct 2024 18:09:10 +0400 Last State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 09 Oct 2024 17:59:10 +0400 Finished: Wed, 09 Oct 2024 18:04:06 +0400 Ready: True Restart Count: 39 Requests: cpu: 250m Liveness: exec [/bin/calico-node -felix-live] delay=10s timeout=10s period=10s #success=1 #failure=6 Readiness: exec [/bin/calico-node -felix-ready] delay=0s timeout=10s period=10s #success=1 #failure=3 Environment Variables from: kubernetes-services-endpoint ConfigMap Optional: true Environment: DATASTORE_TYPE: kubernetes WAIT_FOR_DATASTORE: true NODENAME: (v1:spec.nodeName) CALICO_NETWORKING_BACKEND: <set to the key 'calico_backend' of config map 'calico-config'> Optional: false CLUSTER_TYPE: k8s,bgp IP: autodetect CALICO_IPV4POOL_IPIP: Never CALICO_IPV4POOL_VXLAN: CrossSubnet CALICO_IPV6POOL_VXLAN: CrossSubnet FELIX_IPINIPMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false FELIX_VXLANMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false FELIX_WIREGUARDMTU: <set to the key 'veth_mtu' of config map 'calico-config'> Optional: false CALICO_DISABLE_FILE_LOGGING: true FELIX_DEFAULTENDPOINTTOHOSTACTION: ACCEPT FELIX_IPV6SUPPORT: false FELIX_HEALTHENABLED: true FELIX_AWSSRCDSTCHECK: Disable Mounts: /host/etc/cni/net.d from cni-net-dir (rw) /lib/modules from lib-modules (ro) /run/xtables.lock from xtables-lock (rw) /sys/fs/bpf from bpffs (rw) /var/lib/calico from var-lib-calico (rw) /var/log/calico/cni from cni-log-dir (ro) /var/run/calico from var-run-calico (rw) /var/run/nodeagent from policysync (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zktff (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: var-run-calico: Type: HostPath (bare host directory volume) Path: /var/run/calico HostPathType: var-lib-calico: Type: HostPath (bare host directory volume) Path: /var/lib/calico HostPathType: xtables-lock: Type: HostPath (bare host directory volume) Path: /run/xtables.lock HostPathType: FileOrCreate sys-fs: Type: HostPath (bare host directory volume) Path: /sys/fs/ HostPathType: DirectoryOrCreate bpffs: Type: HostPath (bare host directory volume) Path: /sys/fs/bpf HostPathType: Directory nodeproc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: cni-bin-dir: Type: HostPath (bare host directory volume) Path: /opt/cni/bin HostPathType: cni-net-dir: Type: HostPath (bare host directory volume) Path: /etc/cni/net.d HostPathType: cni-log-dir: Type: HostPath (bare host directory volume) Path: /var/log/calico/cni HostPathType: host-local-net-dir: Type: HostPath (bare host directory volume) Path: /var/lib/cni/networks HostPathType: policysync: Type: HostPath (bare host directory volume) Path: /var/run/nodeagent HostPathType: DirectoryOrCreate kube-api-access-zktff: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: Burstable Node-Selectors: kubernetes.io/os=linux Tolerations: :NoSchedule op=Exists :NoExecute op=Exists CriticalAddonsOnly op=Exists node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: Type Reason Age From Message


Normal SandboxChanged 49m (x29 over 3h15m) kubelet Pod sandbox changed, it will be killed and re-created. Normal Killing 6m50s (x39 over 3h15m) kubelet Stopping container calico-node Warning BackOff 5m14s (x644 over 3h14m) kubelet Back-off restarting failed container calico-node in pod calico-node-4js87_kube-system(2c7fcb86-8d13-48ab-9f65-f78435955023)`

pod full log

2024-10-09 10:58:37.181 [INFO][77] felix/int_dataplane.go 1836: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"kube-system/calico-kube-controllers-8685c56787-4nfrm" endpoint_id:"eth0" > endpoint:<state:"active" name:"calia607cfeb82d" profile_ids:"kns.kube-system" profile_ids:"ksa.kube-system.calico-kube-controllers" ipv4_nets:"192.168.141.160/32" > 2024-10-09 10:58:37.181 [INFO][77] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.181 [INFO][77] felix/table.go 508: Queueing update of chain. chainName="cali-tw-calia607cfeb82d" ipVersion=0x4 table="filter" 2024-10-09 10:58:37.181 [INFO][77] felix/table.go 508: Queueing update of chain. chainName="cali-fw-calia607cfeb82d" ipVersion=0x4 table="filter" 2024-10-09 10:58:37.181 [INFO][77] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.182 [INFO][77] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="calia607cfeb82d" 2024-10-09 10:58:37.182 [INFO][77] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.182 [INFO][77] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.193 [INFO][77] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} ipVersion=0x4 status="up" 2024-10-09 10:58:37.193 [INFO][77] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} status="up" 2024-10-09 10:58:37.193 [INFO][77] felix/summary.go 100: Summarising 26 dataplane reconciliation loops over 1m3s: avg=15ms longest=259ms (resync-filter-v4,resync-ipsets-v4,resync-mangle-v4,resync-nat-v4,resync-raw-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-routes-v4,resync-rules-v4,update-filter-v4,update-ipsets-4,update-mangle-v4,update-nat-v4,update-raw-v4) 2024-10-09 10:58:37.230 [INFO][77] felix/calc_graph.go 467: Local endpoint updated id=WorkloadEndpoint(node=ip-10-161-0-237.eu-west-1.compute.internal, orchestrator=k8s, workload=kube-system/calico-kube-controllers-8685c56787-4nfrm, name=eth0) 2024-10-09 10:58:37.230 [INFO][77] felix/int_dataplane.go 1836: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"kube-system/calico-kube-controllers-8685c56787-4nfrm" endpoint_id:"eth0" > endpoint:<state:"active" name:"calia607cfeb82d" profile_ids:"kns.kube-system" profile_ids:"ksa.kube-system.calico-kube-controllers" ipv4_nets:"192.168.141.160/32" > 2024-10-09 10:58:37.230 [INFO][77] felix/endpoint_mgr.go 602: Updating per-endpoint chains. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.230 [INFO][77] felix/table.go 508: Queueing update of chain. chainName="cali-tw-calia607cfeb82d" ipVersion=0x4 table="filter" 2024-10-09 10:58:37.230 [INFO][77] felix/table.go 508: Queueing update of chain. chainName="cali-fw-calia607cfeb82d" ipVersion=0x4 table="filter" 2024-10-09 10:58:37.230 [INFO][77] felix/endpoint_mgr.go 648: Updating endpoint routes. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.230 [INFO][77] felix/endpoint_mgr.go 1215: Applying /proc/sys configuration to interface. ifaceName="calia607cfeb82d" 2024-10-09 10:58:37.230 [INFO][77] felix/endpoint_mgr.go 490: Re-evaluated workload endpoint status adminUp=true failed=false known=true operUp=true status="up" workloadEndpointID=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.230 [INFO][77] felix/status_combiner.go 58: Storing endpoint status update ipVersion=0x4 status="up" workload=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} 2024-10-09 10:58:37.240 [INFO][77] felix/status_combiner.go 81: Endpoint up for at least one IP version id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} ipVersion=0x4 status="up" 2024-10-09 10:58:37.240 [INFO][77] felix/status_combiner.go 98: Reporting combined status. id=proto.WorkloadEndpointID{OrchestratorId:"k8s", WorkloadId:"kube-system/calico-kube-controllers-8685c56787-4nfrm", EndpointId:"eth0"} status="up" 2024-10-09 10:59:34.180 [INFO][84] monitor-addresses/autodetection_methods.go 103: Using autodetected IPv4 address on interface ens5: 10.161.0.237/26 2024-10-09 10:59:40.306 [INFO][77] felix/summary.go 100: Summarising 12 dataplane reconciliation loops over 1m3.1s: avg=5ms longest=19ms ()

Context

we had slack conversation https://calicousers.slack.com/archives/CPEPQE8CS/p1728467954077839

Your Environment

lwr20 commented 1 day ago

As discussed in slack, this went away when selinux was disabled.

Here's the selinux RPM instructions from the calico enterprise docs: https://docs.tigera.io/calico-enterprise/latest/getting-started/install-on-clusters/requirements