projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.88k stars 1.31k forks source link

calico-node fail to run on Kubernetes 1.29 (calico-typha error) #8453

Closed rrsela closed 3 months ago

rrsela commented 7 months ago

Expected Behavior

calico-node should run after fresh tigera-operator on-prem deployment on Kubernetes 1.29

Current Behavior

calico-node fails to run due to readiness probe error: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

I can't perform kubectl logs but looking through crictl logs on the container I see the following:

2024-01-29 10:50:22.296 [INFO][199362] tunnel-ip-allocator/discovery.go 179: Creating Kubernetes client for Typha discovery...
2024-01-29 10:50:22.296 [INFO][199362] tunnel-ip-allocator/discovery.go 195: (Re)discovering Typha endpoints using the Kubernetes API...
2024-01-29 10:50:22.303 [ERROR][199362] tunnel-ip-allocator/discovery.go 235: Didn't find any ready Typha instances.
2024-01-29 10:50:22.303 [FATAL][199362] tunnel-ip-allocator/startsyncerclient.go 49: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port

Inside calico-typha logs I see the below warnings after the readiness heathcheck:

2024-01-29 08:26:17.901 [INFO][8] health.go 336: Overall health status changed: live=true ready=true
+----------------------------+---------+----------------+-----------------+--------+
|         COMPONENT          | TIMEOUT |    LIVENESS    |    READINESS    | DETAIL |
+----------------------------+---------+----------------+-----------------+--------+
| SyncServer                 | 20s     | reporting live | -               |        |
| bgp-cache                  | 20s     | reporting live | reporting ready |        |
| felix-cache                | 20s     | reporting live | reporting ready |        |
| node-status-cache          | 20s     | reporting live | reporting ready |        |
| tunnel-ip-allocation-cache | 20s     | reporting live | reporting ready |        |
+----------------------------+---------+----------------+-----------------+--------+
2024-01-29 08:26:48.559 [WARNING][8] rebalance.go 54: Failed to get number of Typhas error=<nil> numTyphas=0 thread="k8s-poll"
2024-01-29 08:27:18.599 [WARNING][8] rebalance.go 54: Failed to get number of Typhas error=<nil> numTyphas=0 thread="k8s-poll"

While normally I would expect to see something like sync_server.go 430: Accepted from 192.168.3.156:47314 port=5473

Possible Solution

N/A

Steps to Reproduce (for bugs)

  1. Deploy on-prem EKS-D 1.29-3 (basically kubeadm-based Kubernetes 1.29.0)
  2. Deploy the tigera-operator
  3. Use the below installation spec to deploy calico

Context

Trying to deploy Calico on on-prem kubeadm-based deployment of EKS-D 1.29 using the tigera-operator per the documentation.

Same configuration works fine on EKS-D 1.28, and also when adding an 1.29 worker node to an exiting 1.28 control-plane which is already running Calico..

Worth mentioning while the official documentation does not state Calico supports Kubernetes 1.29, it seems like people are using it in practice and the documentation even reference an issue with the nftables alpha feature of 1.29 (off by default).

Your Environment

MichalFupso commented 6 months ago

Hi @rrsela, do you have multiple network interfaces? You could configure IP auto-detection https://docs.tigera.io/calico/latest/networking/ipam/ip-autodetection

rrsela commented 6 months ago

Hi @MichalFupso - doesn't seem to help, at least I tried kubernetes-internal-ip but in any case I only got a single interface:

root@k8s-master-1:/home/ubuntu# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc fq_codel state UP group default qlen 1000
    link/ether fa:16:3e:5b:d9:16 brd ff:ff:ff:ff:ff:ff
    altname enp0s3
    inet 192.168.3.145/23 metric 100 brd 192.168.3.255 scope global dynamic ens3
       valid_lft 3455sec preferred_lft 3455sec
    inet6 fe80::f816:3eff:fe5b:d916/64 scope link
       valid_lft forever preferred_lft forever
caseydavenport commented 6 months ago

This sounds like Typha isn't reporting ready, and so calico/node isn't able to find a valid instance to connect to. Do your typha pods report ready in the k8s API?

rrsela commented 6 months ago

@caseydavenport Typha report as ready (probably as the health check passes):

root@k8s-master-1:/home/ubuntu# kubectl get pods -n calico-system -o wide
NAME                                       READY   STATUS    RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES
calico-kube-controllers-6ff5954966-zh929   0/1     Pending   0          10m   <none>         <none>         <none>           <none>
calico-node-5f9g2                          0/1     Running   0          10m   <none>         k8s-master-1   <none>           <none>
calico-node-6ncbj                          0/1     Running   0          10m   <none>         k8s-worker-1   <none>           <none>
calico-typha-75b899457d-cj2vz              1/1     Running   0          10m   <none>         k8s-worker-1   <none>           <none>
csi-node-driver-bgptp                      2/2     Running   0          10m   10.244.230.1   k8s-worker-1   <none>           <none>
csi-node-driver-zbl62                      2/2     Running   0          10m   10.244.196.1   k8s-master-1   <none>           <none>

root@k8s-master-1:/home/ubuntu# kubectl get events -n calico-system --field-selector involvedObject.name=calico-typha-75b899457d-cj2vz
LAST SEEN   TYPE     REASON      OBJECT                              MESSAGE
10m         Normal   Scheduled   pod/calico-typha-75b899457d-cj2vz   Successfully assigned calico-system/calico-typha-75b899457d-cj2vz to k8s-worker-1
10m         Normal   Pulling     pod/calico-typha-75b899457d-cj2vz   Pulling image "docker.io/calico/typha:v3.27.0"
10m         Normal   Pulled      pod/calico-typha-75b899457d-cj2vz   Successfully pulled image "docker.io/calico/typha:v3.27.0" in 8.84s (8.84s including waiting)
10m         Normal   Created     pod/calico-typha-75b899457d-cj2vz   Created container calico-typha
10m         Normal   Started     pod/calico-typha-75b899457d-cj2vz   Started container calico-typha
matthewdupre commented 6 months ago

What does kubectl get endpoints say for the Typha service? Felix queries the service and then picks a Typha to connect to based on that.

rrsela commented 6 months ago

@matthewdupre it has none:

kubectl get endpoints -n calico-system
NAME           ENDPOINTS   AGE
calico-typha   <none>      10m
matthewdupre commented 6 months ago

@rrsela Yeah, that would cause this. Looking more carefully, I see kubectl is reporting IP for Typha (and kube-controllers doesn't even have a NODE?!).

Normally you'd see the node IP appearing for Typha (and Calico-Node) and then everything else will probably work normally. The question now is "how is Typha running without Kubernetes knowing its IP?"

rrsela commented 6 months ago

Trying to run Typha in debug mode and failing to do so in operator-mode... But it made me think, could this be a Tigera issue? All I can see inside its log is repeating info & error messages regarding installation reconciliation:

{"level":"info","ts":"2024-03-07T06:34:57Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:34:57Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:34:58Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"periodic-5m0s-reconcile-event"}
{"level":"info","ts":"2024-03-07T06:34:58Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"tigera-operator","Request.Name":"tigera-ca-private"}
{"level":"info","ts":"2024-03-07T06:34:58Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-system"}
{"level":"info","ts":"2024-03-07T06:34:59Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:34:59Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"active-operator"}
{"level":"info","ts":"2024-03-07T06:34:59Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"tigera-operator","Request.Name":"node-certs"}
{"level":"info","ts":"2024-03-07T06:35:00Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"tigera-operator","Request.Name":"typha-certs"}
{"level":"info","ts":"2024-03-07T06:35:00Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-typha"}
{"level":"info","ts":"2024-03-07T06:35:01Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-node"}
{"level":"info","ts":"2024-03-07T06:35:01Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-node"}
{"level":"info","ts":"2024-03-07T06:35:01Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-cni-plugin"}
{"level":"info","ts":"2024-03-07T06:35:02Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-cni-plugin"}
{"level":"info","ts":"2024-03-07T06:35:02Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"csi-node-driver"}
{"level":"info","ts":"2024-03-07T06:35:02Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"calico-system","Request.Name":"calico-kube-controllers"}
{"level":"info","ts":"2024-03-07T06:35:03Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico-kube-controllers"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"calico"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_windows","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_apiserver","msg":"Reconciling APIServer","Request.Namespace":"","Request.Name":"default"}
{"level":"error","ts":"2024-03-07T06:35:27Z","logger":"controller_apiserver","msg":"Waiting for Installation to be ready","Request.Namespace":"","Request.Name":"default","reason":"ResourceNotReady","stacktrace":"github.com/tigera/operator/pkg/controller/status.(*statusManager).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/status/status.go:356\ngithub.com/tigera/operator/pkg/controller/apiserver.(*ReconcileAPIServer).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/apiserver/apiserver_controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}
{"level":"info","ts":"2024-03-07T06:35:27Z","logger":"controller_installation","msg":"Reconciling Installation.operator.tigera.io","Request.Namespace":"","Request.Name":"default"}
caseydavenport commented 6 months ago

and kube-controllers doesn't even have a NODE?!

This makes sense - there are no nodes that Kubernetes deems acceptable for scheduling pods yet, so it hasn't been assigned one (note the "Pending" state)

The question now is "how is Typha running without Kubernetes knowing its IP?"

Yeah, this is strange, and suggests perhaps a problem with kubelet (who reports that IP address) or with the Node objects themselves - maybe check to see if kubectl get nodes lists Node IPs for your nodes? Bit of a long-shot, considering the csi daemonset is showing a pod IP:

csi-node-driver-bgptp 2/2 Running 0 10m 10.244.230.1 k8s-worker-1

kubelet logs might explain more about why that pod has no IP...

rrsela commented 6 months ago

Actually kubectl get nodes -owide does not show any IP:

root@k8s-master-1:/home/ubuntu# kubectl get nodes -owide
NAME           STATUS   ROLES           AGE     VERSION               INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master-1   Ready    control-plane   6d12h   v1.29.0-eks-a5ec690   <none>        <none>        Ubuntu 22.04.3 LTS   5.15.0-92-generic   containerd://1.6.27
k8s-worker-1   Ready    <none>          6d12h   v1.29.0-eks-a5ec690   <none>        <none>        Ubuntu 22.04.3 LTS   5.15.0-92-generic   containerd://1.6.27

Although I do see the node's IP as a calico annotation:

root@k8s-master-1:/home/ubuntu# kubectl get node k8s-master-1 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"csi.tigera.io":"k8s-master-1"}'
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
    node.alpha.kubernetes.io/ttl: "0"
    projectcalico.org/IPv4Address: 192.168.2.57/23
    projectcalico.org/IPv4IPIPTunnelAddr: 10.244.196.0
    volumes.kubernetes.io/controller-managed-attach-detach: "true"

While on an 1.28 node I would get the node status' address attributes like these:

status:
  - address: XXX.XXX.XXX.XXX
    type: InternalIP
  - address: YYY
    type: InternalDNS
  - address: ZZZ
    type: Hostname

On the 1.29 node I don't see it at all...

BTW nothing much on kubelet log regarding typha itself:

Mar 11 19:17:14 k8s-master-1 kubelet[3912]: I0311 19:17:14.064587    3912 topology_manager.go:215] "Topology Admit Handler" podUID="5fa6488f-3de6-45ea-84a8-1f86c525934a" podNamespace="calico-system" podName="calico-typha-6f4997b5f5-xm5sq"
Mar 11 19:17:14 k8s-master-1 kubelet[3912]: E0311 19:17:14.065560    3912 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="d6542f81-a511-4f6e-b4df-815443d49ce2" containerName="calico-typha"
Mar 11 19:17:14 k8s-master-1 kubelet[3912]: I0311 19:17:14.066887    3912 memory_manager.go:354] "RemoveStaleState removing state" podUID="d6542f81-a511-4f6e-b4df-815443d49ce2" containerName="calico-typha"
Mar 11 19:17:14 k8s-master-1 kubelet[3912]: I0311 19:17:14.190784    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"typha-certs\" (UniqueName: \"kubernetes.io/secret/5fa6488f-3de6-45ea-84a8-1f86c525934a-typha-certs\") pod \"calico-typha-6f4997b5f5-xm5sq\" (UID: \"5fa6488f-3de6-45ea-84a8-1f86c525934a\") " pod="calico-system/calico-typha-6f4997b5f5-xm5sq"
Mar 11 19:17:14 k8s-master-1 kubelet[3912]: I0311 19:17:14.190909    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-7wrc7\" (UniqueName: \"kubernetes.io/projected/5fa6488f-3de6-45ea-84a8-1f86c525934a-kube-api-access-7wrc7\") pod \"calico-typha-6f4997b5f5-xm5sq\" (UID: \"5fa6488f-3de6-45ea-84a8-1f86c525934a\") " pod="calico-system/calico-typha-6f4997b5f5-xm5sq"
Mar 11 19:17:14 k8s-master-1 kubelet[3912]: I0311 19:17:14.191161    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"tigera-ca-bundle\" (UniqueName: \"kubernetes.io/configmap/5fa6488f-3de6-45ea-84a8-1f86c525934a-tigera-ca-bundle\") pod \"calico-typha-6f4997b5f5-xm5sq\" (UID: \"5fa6488f-3de6-45ea-84a8-1f86c525934a\") " pod="calico-system/calico-typha-6f4997b5f5-xm5sq"
Mar 11 19:17:15 k8s-master-1 kubelet[3912]: I0311 19:17:15.929907    3912 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="calico-system/calico-typha-6f4997b5f5-xm5sq" podStartSLOduration=1.9298010909999999 podStartE2EDuration="1.929801091s" podCreationTimestamp="2024-03-11 19:17:14 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-03-11 19:17:15.879945351 +0000 UTC m=+563260.200073104" watchObservedRunningTime="2024-03-11 19:17:15.929801091 +0000 UTC m=+563260.249928874"

Regarding the csi-node-driver, the kubelet log shows this for csi-node-driver-8d89t (plus some other pod's entries):

Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.532167    3912 topology_manager.go:215] "Topology Admit Handler" podUID="b970f017-1c3f-4e90-b4c9-e4c03461202a" podNamespace="calico-system" podName="csi-node-driver-8d89t"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: E0311 20:06:16.532432    3912 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="0fc5905a-377c-4ff0-ab3d-5f276c081646" containerName="calico-csi"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: E0311 20:06:16.532843    3912 cpu_manager.go:395] "RemoveStaleState: removing container" podUID="0fc5905a-377c-4ff0-ab3d-5f276c081646" containerName="csi-node-driver-registrar"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.533097    3912 memory_manager.go:354] "RemoveStaleState removing state" podUID="0fc5905a-377c-4ff0-ab3d-5f276c081646" containerName="calico-csi"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.533291    3912 memory_manager.go:354] "RemoveStaleState removing state" podUID="0fc5905a-377c-4ff0-ab3d-5f276c081646" containerName="csi-node-driver-registrar"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.564026    3912 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-kubelet-dir\") pod \"0fc5905a-377c-4ff0-ab3d-5f276c081646\" (UID: \"0fc5905a-377c-4ff0-ab3d-5f276c081646\") "
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.564176    3912 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"socket-dir\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-socket-dir\") pod \"0fc5905a-377c-4ff0-ab3d-5f276c081646\" (UID: \"0fc5905a-377c-4ff0-ab3d-5f276c081646\") "
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.564711    3912 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"kube-api-access-9vwts\" (UniqueName: \"kubernetes.io/projected/0fc5905a-377c-4ff0-ab3d-5f276c081646-kube-api-access-9vwts\") pod \"0fc5905a-377c-4ff0-ab3d-5f276c081646\" (UID: \"0fc5905a-377c-4ff0-ab3d-5f276c081646\") "
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.564935    3912 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"varrun\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-varrun\") pod \"0fc5905a-377c-4ff0-ab3d-5f276c081646\" (UID: \"0fc5905a-377c-4ff0-ab3d-5f276c081646\") "
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.565110    3912 reconciler_common.go:172] "operationExecutor.UnmountVolume started for volume \"registration-dir\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-registration-dir\") pod \"0fc5905a-377c-4ff0-ab3d-5f276c081646\" (UID: \"0fc5905a-377c-4ff0-ab3d-5f276c081646\") "
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.565455    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"socket-dir\" (UniqueName: \"kubernetes.io/host-path/b970f017-1c3f-4e90-b4c9-e4c03461202a-socket-dir\") pod \"csi-node-driver-8d89t\" (UID: \"b970f017-1c3f-4e90-b4c9-e4c03461202a\") " pod="calico-system/csi-node-driver-8d89t"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.565698    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-zxhvs\" (UniqueName: \"kubernetes.io/projected/b970f017-1c3f-4e90-b4c9-e4c03461202a-kube-api-access-zxhvs\") pod \"csi-node-driver-8d89t\" (UID: \"b970f017-1c3f-4e90-b4c9-e4c03461202a\") " pod="calico-system/csi-node-driver-8d89t"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.565805    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"registration-dir\" (UniqueName: \"kubernetes.io/host-path/b970f017-1c3f-4e90-b4c9-e4c03461202a-registration-dir\") pod \"csi-node-driver-8d89t\" (UID: \"b970f017-1c3f-4e90-b4c9-e4c03461202a\") " pod="calico-system/csi-node-driver-8d89t"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.565989    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/b970f017-1c3f-4e90-b4c9-e4c03461202a-kubelet-dir\") pod \"csi-node-driver-8d89t\" (UID: \"b970f017-1c3f-4e90-b4c9-e4c03461202a\") " pod="calico-system/csi-node-driver-8d89t"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.566117    3912 reconciler_common.go:258] "operationExecutor.VerifyControllerAttachedVolume started for volume \"varrun\" (UniqueName: \"kubernetes.io/host-path/b970f017-1c3f-4e90-b4c9-e4c03461202a-varrun\") pod \"csi-node-driver-8d89t\" (UID: \"b970f017-1c3f-4e90-b4c9-e4c03461202a\") " pod="calico-system/csi-node-driver-8d89t"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.570280    3912 operation_generator.go:881] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-varrun" (OuterVolumeSpecName: "varrun") pod "0fc5905a-377c-4ff0-ab3d-5f276c081646" (UID: "0fc5905a-377c-4ff0-ab3d-5f276c081646"). InnerVolumeSpecName "varrun". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.570385    3912 operation_generator.go:881] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-socket-dir" (OuterVolumeSpecName: "socket-dir") pod "0fc5905a-377c-4ff0-ab3d-5f276c081646" (UID: "0fc5905a-377c-4ff0-ab3d-5f276c081646"). InnerVolumeSpecName "socket-dir". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.570506    3912 operation_generator.go:881] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-registration-dir" (OuterVolumeSpecName: "registration-dir") pod "0fc5905a-377c-4ff0-ab3d-5f276c081646" (UID: "0fc5905a-377c-4ff0-ab3d-5f276c081646"). InnerVolumeSpecName "registration-dir". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.570543    3912 operation_generator.go:881] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-kubelet-dir" (OuterVolumeSpecName: "kubelet-dir") pod "0fc5905a-377c-4ff0-ab3d-5f276c081646" (UID: "0fc5905a-377c-4ff0-ab3d-5f276c081646"). InnerVolumeSpecName "kubelet-dir". PluginName "kubernetes.io/host-path", VolumeGidValue ""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.581460    3912 operation_generator.go:881] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/0fc5905a-377c-4ff0-ab3d-5f276c081646-kube-api-access-9vwts" (OuterVolumeSpecName: "kube-api-access-9vwts") pod "0fc5905a-377c-4ff0-ab3d-5f276c081646" (UID: "0fc5905a-377c-4ff0-ab3d-5f276c081646"). InnerVolumeSpecName "kube-api-access-9vwts". PluginName "kubernetes.io/projected", VolumeGidValue ""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.666793    3912 reconciler_common.go:300] "Volume detached for volume \"varrun\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-varrun\") on node \"k8s-master-1\" DevicePath \"\""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.667802    3912 reconciler_common.go:300] "Volume detached for volume \"registration-dir\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-registration-dir\") on node \"k8s-master-1\" DevicePath \"\""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.668084    3912 reconciler_common.go:300] "Volume detached for volume \"kubelet-dir\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-kubelet-dir\") on node \"k8s-master-1\" DevicePath \"\""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.668325    3912 reconciler_common.go:300] "Volume detached for volume \"socket-dir\" (UniqueName: \"kubernetes.io/host-path/0fc5905a-377c-4ff0-ab3d-5f276c081646-socket-dir\") on node \"k8s-master-1\" DevicePath \"\""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.668535    3912 reconciler_common.go:300] "Volume detached for volume \"kube-api-access-9vwts\" (UniqueName: \"kubernetes.io/projected/0fc5905a-377c-4ff0-ab3d-5f276c081646-kube-api-access-9vwts\") on node \"k8s-master-1\" DevicePath \"\""
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.713516    3912 scope.go:117] "RemoveContainer" containerID="607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.740843    3912 scope.go:117] "RemoveContainer" containerID="6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.786135    3912 scope.go:117] "RemoveContainer" containerID="607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: E0311 20:06:16.787267    3912 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078\": not found" containerID="607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.787557    3912 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={"Type":"containerd","ID":"607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078"} err="failed to get container status \"607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078\": rpc error: code = NotFound desc = an error occurred when try to find container \"607585df38cd64b705edf942421dcb503c00b357922a00fc17c1cafac508d078\": not found"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.787768    3912 scope.go:117] "RemoveContainer" containerID="6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: E0311 20:06:16.788610    3912 remote_runtime.go:432] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = an error occurred when try to find container \"6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d\": not found" containerID="6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.788812    3912 pod_container_deletor.go:53] "DeleteContainer returned error" containerID={"Type":"containerd","ID":"6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d"} err="failed to get container status \"6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d\": rpc error: code = NotFound desc = an error occurred when try to find container \"6aab53426f053a92f4f81f292a7a7893b7d4dd3a0d1cbb1ae54bd9341b8bdc3d\": not found"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.888487    3912 scope.go:117] "RemoveContainer" containerID="ef18c2a80239c33c420cd3a0722fe22415b549c81417557e063db77393783341"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.888950    3912 scope.go:117] "RemoveContainer" containerID="cb197c6506be238e85c35e2bad186824ab3b7c781ab9a4d4c5d9997e232cca89"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: I0311 20:06:16.889131    3912 scope.go:117] "RemoveContainer" containerID="6e6f3d5072a5fffd8a61e72e20ddfe41fc4aa4a6da2c174cff5735d5c25d16fa"
Mar 11 20:06:16 k8s-master-1 kubelet[3912]: E0311 20:06:16.889883    3912 pod_workers.go:1298] "Error syncing pod, skipping" err="[failed to \"StartContainer\" for \"ebs-plugin\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=ebs-plugin pod=ebs-csi-node-zlrmm_kube-system(d95b1805-89de-4aee-9ccd-e504b705f301)\", failed to \"StartContainer\" for \"node-driver-registrar\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=node-driver-registrar pod=ebs-csi-node-zlrmm_kube-system(d95b1805-89de-4aee-9ccd-e504b705f301)\", failed to \"StartContainer\" for \"liveness-probe\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=liveness-probe pod=ebs-csi-node-zlrmm_kube-system(d95b1805-89de-4aee-9ccd-e504b705f301)\"]" pod="kube-system/ebs-csi-node-zlrmm" podUID="d95b1805-89de-4aee-9ccd-e504b705f301"
Mar 11 20:06:17 k8s-master-1 kubelet[3912]: I0311 20:06:17.898510    3912 kubelet_volumes.go:161] "Cleaned up orphaned pod volumes dir" podUID="0fc5905a-377c-4ff0-ab3d-5f276c081646" path="/var/lib/kubelet/pods/0fc5905a-377c-4ff0-ab3d-5f276c081646/volumes"
Mar 11 20:06:18 k8s-master-1 kubelet[3912]: I0311 20:06:18.973398    3912 csi_plugin.go:99] kubernetes.io/csi: Trying to validate a new CSI Driver with name: csi.tigera.io endpoint: /var/lib/kubelet/plugins/csi.tigera.io/csi.sock versions: 1.0.0
Mar 11 20:06:18 k8s-master-1 kubelet[3912]: I0311 20:06:18.974020    3912 csi_plugin.go:112] kubernetes.io/csi: Register new plugin with name: csi.tigera.io at endpoint: /var/lib/kubelet/plugins/csi.tigera.io/csi.sock
Mar 11 20:06:20 k8s-master-1 kubelet[3912]: I0311 20:06:20.888882    3912 scope.go:117] "RemoveContainer" containerID="e7adc3aee6692dbbf58833da569acd08408091bc898e8ecd7c8fa3d3b673b57c"
Mar 11 20:06:20 k8s-master-1 kubelet[3912]: E0311 20:06:20.889372    3912 pod_workers.go:1298] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"aws-cloud-controller-manager\" with CrashLoopBackOff: \"back-off 5m0s restarting failed container=aws-cloud-controller-manager pod=aws-cloud-controller-manager-4tsfg_kube-system(6af95188-e1cc-476c-a33b-9046efd953c9)\"" pod="kube-system/aws-cloud-controller-manager-4tsfg" podUID="6af95188-e1cc-476c-a33b-9046efd953c9"
Mar 11 20:06:20 k8s-master-1 kubelet[3912]: I0311 20:06:20.919357    3912 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="calico-system/csi-node-driver-8d89t" podStartSLOduration=4.919308302 podStartE2EDuration="4.919308302s" podCreationTimestamp="2024-03-11 20:06:16 +0000 UTC" firstStartedPulling="0001-01-01 00:00:00 +0000 UTC" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2024-03-11 20:06:18.745763747 +0000 UTC m=+566203.065891522" watchObservedRunningTime="2024-03-11 20:06:20.919308302 +0000 UTC m=+566205.239436033"
caseydavenport commented 6 months ago

@rrsela I am not entirely sure why that is, but it seems likely that the missing node IPs are causing this issue with Calico. I'd recommend investigating why those IPs stopped showing up in your environment.

rrsela commented 6 months ago

@caseydavenport the thing is the same setup work with Flannel or Cilium, but not with Calico...

For example with Flannel:

root@demo-master-1:/# kubectl get nodes -owide
NAME            STATUS   ROLES           AGE     VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
demo-master-1   Ready    control-plane   4m41s   v1.29.0-eks-a5ec690   192.168.2.216   <none>        Ubuntu 22.04.3 LTS   5.15.0-92-generic   containerd://1.6.27
demo-worker-1   Ready    <none>          4m22s   v1.29.0-eks-a5ec690   192.168.2.65    <none>        Ubuntu 22.04.3 LTS   5.15.0-92-generic   containerd://1.6.27

root@demo-master-1:/# kubectl get node demo-master-1 -oyaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-2a80c38632a34a80abcb2a4f55b275d3"}'
    flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"42:42:55:8f:b6:41"}'
    flannel.alpha.coreos.com/backend-type: vxlan
    flannel.alpha.coreos.com/kube-subnet-manager: "true"
    flannel.alpha.coreos.com/public-ip: 192.168.2.216
    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
...
status:
  addresses:
  - address: 192.168.2.216
    type: InternalIP
  - address: demo-master-1.symphony.local
    type: InternalDNS
  - address: demo-master-1.symphony.local
    type: Hostname
...
caseydavenport commented 5 months ago

@rrsela do your nodes show Internal IPs prior to installing a CNI plugin?

I'm a little bit stumped about why this would have anything to do with Calico, since Calico doesn't actually modify the IP addresses of the nodes.

rrsela commented 5 months ago

@caseydavenport I validated that the nodes do not have internal IPs prior to the CNI installation (also checked with Flannel to be sure) - seems like this field is being populated following the CNI deployment, via CCM, which depends on the CNI to work in order to be deployed..

rrsela commented 5 months ago

Based on the above I managed to replicate the issue with vanilla Kubernetes and not just EKS-D - as long as Kubelet is defined with --cloud-provider=external (so the CCM handles the node IP post CNI deployment) Typha is unable to get it (as it runs before the CCM can run) and calico-node fails to use it...

Not sure how/why it works on 1.28 though, maybe some internal cloud-provider code removals?

caseydavenport commented 5 months ago

Interesting - does the external cloud provider CCM pod run with host networking or with pod networking?

rrsela commented 5 months ago

@caseydavenport it runs with pod networking - at least the AWS cloud controller manager is..

BTW I found the 1.29 change regarding the default node IP behaviour - until this PR was merged nodes always had IPs even prior to the cloud provider (CCM) deployment...

caseydavenport commented 5 months ago

Ok, so it sounds like because of that change nodes are not getting temporary IP addresses, which prevents calico/node from talking to the apiserver via Typha.

Since calico/node can't talk to Typha, pods never get networking and thus the external CCM isn't able to start, and we are in a deadlock situation.

These are the options that jump out at me, none of which are super appealing or necessarily a quick fix:

tomastigera commented 3 months ago

@rrsela I am closing it now, but feel free to reopen if you have any new information