projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

Calico >= v3.19.0 loop-crashes on physical worker nodes #4911

Closed ledroide closed 3 years ago

ledroide commented 3 years ago

Calico >= v3.19.0 loop-crashes on physical worker nodes only.

Effects

Networking inside Kubernetes (pods, services, etc.) does not work anymore after upgrading calico to v3.19.0 or higher.

Symptoms

Summary :

Version we have tried

Before calico-node is updated

With calico-node:v3.18.4, all pods are Running and Ready.

calicoctl node status is fine everywhere :

[root@frd3kq-k8s01g ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 10.150.236.2  | node-to-node mesh | up    | 10:10:41 | Established |
| 10.150.236.3  | node-to-node mesh | up    | 10:10:40 | Established |
| 10.150.236.4  | node-to-node mesh | up    | 10:10:41 | Established |
| 10.150.236.11 | node-to-node mesh | up    | 10:10:40 | Established |
| 10.150.236.12 | node-to-node mesh | up    | 10:10:39 | Established |
| 10.150.236.13 | node-to-node mesh | up    | 10:10:39 | Established |
+---------------+-------------------+-------+----------+-------------+

Reproduce the issue

Then, to reproduce the issue, we upgrade to any version >= v3.19.0

kubectl set image daemonset/calico-node -n kube-system calico-node=quay.io/calico/node:v3.19.2

At first, status is Running, half of worker node remain NotReady, then the kubelet restarts them. Then other servers get a NotReady status, and then they are restarted.

$ kubectl get pod -o wide -l k8s-app=calico-node -n kube-system --context dev
NAME                READY   STATUS             RESTARTS   AGE   IP              NODE               NOMINATED NODE   READINESS GATES
calico-node-2tvfw   1/1     Running            0          47m   10.150.236.11   kube-dev-master1   <none>           <none>
calico-node-8k7zc   0/1     CrashLoopBackOff   15         47m   10.150.236.3    frd3kq-k8s03       <none>           <none>
calico-node-92grd   0/1     CrashLoopBackOff   17         47m   10.150.236.1    frd3kq-k8s01g      <none>           <none>
calico-node-f84mm   0/1     Running            15         47m   10.150.236.4    frd3kq-k8s04       <none>           <none>
calico-node-t59s9   1/1     Running            0          47m   10.150.236.13   kube-dev-master3   <none>           <none>
calico-node-vgzng   0/1     Running            14         47m   10.150.236.2    frd3kq-k8s02g      <none>           <none>
calico-node-wj46m   1/1     Running            0          34m   10.150.236.12   kube-dev-master2   <none>           <none>

Then it crashes again and again, Restarts count increases, network for Kubernetes pods and services do not work anymore.

On worker nodes

A random node :

[root@frd3kq-k8s01g ~]# calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+----------+--------------------------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |              INFO              |
+---------------+-------------------+-------+----------+--------------------------------+
| 10.150.236.2  | node-to-node mesh | start | 10:11:39 | Passive Socket: Connection     |
|               |                   |       |          | closed                         |
| 10.150.236.3  | node-to-node mesh | start | 10:11:39 | OpenSent Socket: Connection    |
|               |                   |       |          | closed                         |
| 10.150.236.4  | node-to-node mesh | up    | 10:10:40 | Established                    |
| 10.150.236.11 | node-to-node mesh | start | 10:11:39 | Active Socket: Connection      |
|               |                   |       |          | refused                        |
| 10.150.236.12 | node-to-node mesh | wait  | 10:10:39 | Established                    |
| 10.150.236.13 | node-to-node mesh | up    | 10:10:39 | Established                    |
+---------------+-------------------+-------+----------+--------------------------------+

An other random node :

root@frd3kq-k8s03 ~]# calicoctl node status
Calico process is running.
None of the BGP backend processes (BIRD or GoBGP) are running.

Whereas BGP port answers from and to every other node, i.e. :

[root@frd3kq-k8s01g ~]# nc -nzvw1 10.150.236.12 179
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.150.236.12:179.
Ncat: 0 bytes sent, 0 bytes received in 0.01 seconds.

logs

We could not find any error message in calico pods, even when they are restarted by the kubelet.

However, when using calico-node >= 3.19.0, we can see these logs in dmesg (here calico-node:v3.19.2):

[Mon Aug 30 10:45:16 2021] IPv6: ADDRCONF(NETDEV_UP): cali7f1303b47ac: link is not ready
[Mon Aug 30 10:45:16 2021] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[Mon Aug 30 10:45:16 2021] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Aug 30 10:45:16 2021] IPv6: ADDRCONF(NETDEV_CHANGE): cali7f1303b47ac: link becomes ready
[Mon Aug 30 10:45:19 2021] IPv6: ADDRCONF(NETDEV_UP): cali5f470a664c7: link is not ready
[Mon Aug 30 10:45:19 2021] IPv6: ADDRCONF(NETDEV_CHANGE): cali5f470a664c7: link becomes ready
[Mon Aug 30 10:45:23 2021] IPv6: ADDRCONF(NETDEV_UP): cali6268b43f182: link is not ready
[Mon Aug 30 10:45:23 2021] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[Mon Aug 30 10:45:23 2021] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Mon Aug 30 10:45:23 2021] IPv6: ADDRCONF(NETDEV_CHANGE): cali6268b43f182: link becomes ready
[Mon Aug 30 11:51:37 2021] mlx5_core 0000:3b:00.0: mlx5_fw_tracer_handle_traces:708:(pid 68385): FWTracer: Events were lost
[Mon Aug 30 12:08:42 2021] IPv6: ADDRCONF(NETDEV_UP): eno3: link is not ready
[Mon Aug 30 12:08:42 2021] IPv6: ADDRCONF(NETDEV_UP): eno4: link is not ready
[Mon Aug 30 12:08:42 2021] IPv6: ADDRCONF(NETDEV_UP): ens1f0: link is not ready
[Mon Aug 30 12:08:42 2021] IPv6: ADDRCONF(NETDEV_UP): ens1f1: link is not ready

But we've got similar logs when everything runs fine with calico-node:3.18.4, and they can be found also servers that have no relation to this cluster. Said so, it does not look relevant to this issue.

[Thu Sep  2 15:27:17 2021] IPv6: ADDRCONF(NETDEV_UP): cali3f230d9d963: link is not ready
[Thu Sep  2 15:27:17 2021] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[Thu Sep  2 15:27:17 2021] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Thu Sep  2 15:27:17 2021] IPv6: ADDRCONF(NETDEV_CHANGE): cali3f230d9d963: link becomes ready
[Thu Sep  2 15:27:17 2021] IPv6: ADDRCONF(NETDEV_UP): calid9e5cd0c443: link is not ready
[Thu Sep  2 15:27:17 2021] IPv6: ADDRCONF(NETDEV_CHANGE): calid9e5cd0c443: link becomes ready
[Thu Sep  2 15:27:18 2021] IPv6: ADDRCONF(NETDEV_UP): cali703f57c3a2d: link is not ready
[Thu Sep  2 15:27:18 2021] IPv6: ADDRCONF(NETDEV_CHANGE): cali703f57c3a2d: link becomes ready

What we have also tried

Disable IPv6 and restart networkmanager - All nodes (no effect)

[root@frd3kq-k8s03 ~]# echo 1 > /proc/sys/net/ipv6/conf/default/disable_ipv6
[root@frd3kq-k8s03 ~]# echo 1 > /proc/sys/net/ipv6/conf/all/disable_ipv6
[root@frd3kq-k8s03 ~]# systemctl restart NetworkManager
[root@frd3kq-k8s03 ~]# systemctl restart kubelet
[root@frd3kq-k8s03 ~]# cat /proc/sys/net/ipv6/conf/default/disable_ipv6 /proc/sys/net/ipv6/conf/all/disable_ipv6
1
1
[root@frd3kq-k8s03 ~]# systemctl restart kubelet
[root@frd3kq-k8s03 ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet Server
  Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Active: active (running) since Fri 2021-08-27 13:06:19 CEST; 33s ago
(...)

Disable IPv6 and reboot - All nodes (no effect)

[root@frd3kq-k8s01g ~]# echo -e "\n#Calico fix attempt \nnet.ipv6.conf.all.disable_ipv6=1\nnet.ipv6.conf.default.disable_ipv6=1" >> /etc/sysctl.d/99-sysctl.conf
[root@frd3kq-k8s01g ~]# reboot

Environment

caseydavenport commented 3 years ago

@ledroide it's possible that you're not seeing error logs because calico/node has restarted.

Could you try using the -p argument to kubectl logs to receive the previous container's log output?

e.g.,

kubectl logs -p -n kube-system <calico-node-pod-x>

hopefully that gives us a clue.

ledroide commented 3 years ago

Tanks @caseydavenport . We are used to "kubectl logs" feature. I personally use stern for debugging in command line, which allows to monitoring logs from multiple pods simultaneously : stern --container-state terminated --tail 50 calico-node outputs the 50 last lines of containers that have crashed.

Also, we use Promtail + Loki + Grafana to access logs from containers that do not run anymore.

As you have requested, you'll find on this gist 2 sections :

  1. v3.18.5 running fine + logs
  2. v3.19.2 failing + logs from previous (crashed) container
BongoEADGC6 commented 3 years ago

I have a similar issue as above. v3.19.2 fails for Ubuntu 20.04 VMs running in the same VLAN on Proxmox. However, v3.18.5 is perfectly fine.

caseydavenport commented 3 years ago

@ledroide I just noticed this part from your issue:

kubectl set image daemonset/calico-node -n kube-system calico-node=quay.io/calico/node:v3.19.2

In general this is a bad practice - we can't guarantee that the same manifests that work with version X will work with version X+1. Our recommendation for upgrade is to review and apply the updated manifests for the new release, as they may have required changes for the new version.

It's not obvious to me from the logs what is going wrong, but it feels likely that this step could be related. Are you seeing this on fresh installs if you use one of the manifests from https://docs.projectcalico.org/archive/v3.19/getting-started/kubernetes/ ?

On a related note, the tigera-operator simplifies a lot of this upgrade so that it becomes as simple as replacing the image, so if you are able I'd recommend taking a look at that installer.

@BongoEADGC6 any chance you're doing a similar workflow (starting with one version and update the image to a new version in-place)?

BongoEADGC6 commented 3 years ago

@caseydavenport Good evening. Yes, that is correct. Kubespray will typically update these image tags as part of the cluster upgrade process. Who would be the best course of action for them to allow for upgrades to occur? Update the tags version and then reapply the CRDs?

ledroide commented 3 years ago

@caseydavenport : At first, it's Kubespray that updated the calico-node manifests with v3.19.2. Then the problem raised, we have searched during hours before trying to change the image tag.

In order to dispel any doubt about the Kubespray manifest involvement in this issue, I have check in Kubespray sources : there is no change - except the daemonset calico-node image tag - that could have an effect on worker nodes communication. The Kubespray implementation for Calico looks fine, it had always worked, and there is no reason for this implementation to be wrong for calico-node v3.19.2

caseydavenport commented 3 years ago

None of the BGP backend processes (BIRD or GoBGP) are running.

This error in the original post suggests that the BGP agent is failing to run. Could you check the output of the following command in one of your failing calico/node pods?

kubectl exec -ti <calico-node-name> -- ls /etc/calico/confd/config

I have not yet been able to reproduce this issue myself, just seems to work for me, so guessing it is something particular about KubeSpray or your environment that I am missing.

caseydavenport commented 3 years ago

Who would be the best course of action for them to allow for upgrades to occur? Update the tags version and then reapply the CRDs?

The best course of action long-term is going to be to upgrade to use the operator-based install method, as it handles upgrading configuration as needed for each release.

Barring that, you will need to analyze the diff between the manifests on docs.projectcalico.org/archive/v3.18 vs the ones at docs.projectcalico.org/archive/v3.19 and make the corresponding updates when upgrading.

In the case of v3.18 -> v3.19, it looks like that diff is pretty small, but that won't always be the case.

513a514,520
>               bpfExtToServiceConnmark:
>                 description: 'BPFExtToServiceConnmark in BPF mode, control a 32bit
>                   mark that is set on connections from an external client to a local
>                   service. This mark allows us to control how packets of that connection
>                   are routed within the host and how is routing intepreted by RPF
>                   check. [Default: 0]'
>                 type: integer
602,603c609,610
<                 description: 'FailsafeInboundHostPorts is a comma-delimited list of
<                   UDP/TCP ports that Felix will allow incoming traffic to host endpoints
---
>                 description: 'FailsafeInboundHostPorts is a list of UDP/TCP ports
>                   and CIDRs that Felix will allow incoming traffic to host endpoints
605,609c612,616
<                   accidentally cutting off a host with incorrect configuration. Each
<                   port should be specified as tcp:<port-number> or udp:<port-number>.
<                   For back-compatibility, if the protocol is not specified, it defaults
<                   to "tcp". To disable all inbound host ports, use the value none.
<                   The default value allows ssh access and DHCP. [Default: tcp:22,
---
>                   accidentally cutting off a host with incorrect configuration. For
>                   back-compatibility, if the protocol is not specified, it defaults
>                   to "tcp". If a CIDR is not specified, it will allow traffic from
>                   all addresses. To disable all inbound host ports, use the value
>                   none. The default value allows ssh access and DHCP. [Default: tcp:22,
612,613c619,620
<                   description: ProtoPort is combination of protocol and port, both
<                     must be specified.
---
>                   description: ProtoPort is combination of protocol, port, and CIDR.
>                     Protocol and port must be specified.
614a622,623
>                     net:
>                       type: string
625,635c634,644
<                 description: 'FailsafeOutboundHostPorts is a comma-delimited list
<                   of UDP/TCP ports that Felix will allow outgoing traffic from host
<                   endpoints to irrespective of the security policy. This is useful
<                   to avoid accidentally cutting off a host with incorrect configuration.
<                   Each port should be specified as tcp:<port-number> or udp:<port-number>.
<                   For back-compatibility, if the protocol is not specified, it defaults
<                   to "tcp". To disable all outbound host ports, use the value none.
<                   The default value opens etcd''s standard ports to ensure that Felix
<                   does not get cut off from etcd as well as allowing DHCP and DNS.
<                   [Default: tcp:179, tcp:2379, tcp:2380, tcp:6443, tcp:6666, tcp:6667,
<                   udp:53, udp:67]'
---
>                 description: 'FailsafeOutboundHostPorts is a list of UDP/TCP ports
>                   and CIDRs that Felix will allow outgoing traffic from host endpoints
>                   to irrespective of the security policy. This is useful to avoid
>                   accidentally cutting off a host with incorrect configuration. For
>                   back-compatibility, if the protocol is not specified, it defaults
>                   to "tcp". If a CIDR is not specified, it will allow traffic from
>                   all addresses. To disable all outbound host ports, use the value
>                   none. The default value opens etcd''s standard ports to ensure that
>                   Felix does not get cut off from etcd as well as allowing DHCP and
>                   DNS. [Default: tcp:179, tcp:2379, tcp:2380, tcp:6443, tcp:6666,
>                   tcp:6667, udp:53, udp:67]'
637,638c646,647
<                   description: ProtoPort is combination of protocol and port, both
<                     must be specified.
---
>                   description: ProtoPort is combination of protocol, port, and CIDR.
>                     Protocol and port must be specified.
639a649,650
>                     net:
>                       type: string
3536c3547
<           image: docker.io/calico/cni:v3.18.5
---
>           image: docker.io/calico/cni:v3.19.3
3563c3574
<           image: docker.io/calico/cni:v3.18.5
---
>           image: docker.io/calico/cni:v3.19.3
3604c3615
<           image: docker.io/calico/pod2daemon-flexvol:v3.18.5
---
>           image: docker.io/calico/pod2daemon-flexvol:v3.19.3
3615c3626
<           image: docker.io/calico/node:v3.18.5
---
>           image: docker.io/calico/node:v3.19.3
3683,3685d3693
<             # Set Felix logging to "info"
<             - name: FELIX_LOGSEVERITYSCREEN
<               value: "info"
3825c3833
<           image: docker.io/calico/kube-controllers:v3.18.5
---
>           image: docker.io/calico/kube-controllers:v3.19.3
3831a3840,3847
>           livenessProbe:
>             exec:
>               command:
>               - /usr/bin/check-status
>               - -l
>             periodSeconds: 10
>             initialDelaySeconds: 10
>             failureThreshold: 6
3836a3853
>             periodSeconds: 10
caseydavenport commented 3 years ago

As you have requested, you'll find on this gist 2 sections :

@ledroide I suspect the reason I see nothing interesting there is because the error happens earlier in the log. It sounds likely to me that for some reason BIRD isn't starting, and then the liveness check is eventually failing and Kubernetes restarts the pod.

ledroide commented 3 years ago

@caseydavenport : you had the right intuition. The origin of the issue is in the estimated time for the command executed as liveness and readiness probes, in the DaemonSet/calico-node. Obviously, it takes mode processing time since calico-node:3.19.0.

The secondary issue concerns the daemonset template from the kubespray implementation :

          livenessProbe:
            exec:
              command:
              - /bin/calico-node
              - -felix-live
{% if calico_network_backend|default("bird") == "bird" %}
              - -bird-live
{% endif %}
            periodSeconds: 10
            initialDelaySeconds: 10
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
              - /bin/calico-node
{% if calico_network_backend|default("bird") == "bird" %}
              - -bird-ready
{% endif %}
              - -felix-ready
            periodSeconds: 10
            failureThreshold: 6

You will notice that there is no timeoutSeconds defined, so it takes the default value, that is "1".

$ kubectl explain pod.spec.containers.livenessProbe.timeoutSeconds
KIND:     Pod
VERSION:  v1
FIELD:    timeoutSeconds <integer>
DESCRIPTION:
     Number of seconds after which the probe times out. Defaults to 1 second.
     Minimum value is 1. More info:
     https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes

New setting :

This setting runs fine, as expected.

However, I have tried v3.20.0 and 3.20.1, and they remain in NotReady status. I have tried higher readinessProbe.timeoutSeconds values, but it didn't get Ready status anyway.

I have also checked Calico's implementation on this repository. In this doc getting-started/kubernetes/hardway/install-node.md

It does not either set the timeoutSeconds values, that means kubernetes will also set "1" second by default. Maybe that explains why Kubespray cherry-picked the same settings on you repository.

          livenessProbe:
            httpGet:
              path: /liveness
              port: 9099
              host: localhost
            periodSeconds: 10
            initialDelaySeconds: 10
            failureThreshold: 6
          readinessProbe:
            exec:
              command:
              - /bin/calico-node
              - -bird-ready
              - -felix-ready
            periodSeconds: 10

I'm opening an issue #7979 in Kubespray.

caseydavenport commented 3 years ago

Thanks for following up with the great explanation @ledroide

However, I have tried v3.20.0 and 3.20.1, and they remain in NotReady status.

This bit is still concerning to me and probably needs more investigation. Will check to see if there is a manifest diff missing.

ledroide commented 3 years ago

This issue can be closed, since the root cause is identified, and my case with secondsTimeout values is solved by Kubespray PR approval.

kabakaev commented 3 years ago

However, I have tried v3.20.0 and 3.20.1, and they remain in NotReady status.

This bit is still concerning to me and probably needs more investigation. Will check to see if there is a manifest diff missing.

I had the same issue when upgrading from v3.19.3 to v3.20.2. Replacing the image tag leads to permanent NotReady state with logs full of

2021-11-09 15:51:50.231 [WARNING][83] felix/health.go 188: Reporter is not ready. name="int_dataplane"
2021-11-09 15:51:50.231 [WARNING][83] felix/health.go 188: Reporter is not ready. name="async_calc_graph"
2021-11-09 15:51:50.231 [WARNING][83] felix/health.go 154: Health: not ready

In my case, the reason was the outdated ClusterRole calico-node. Applying the ClusterRoles from the v3.20 manifests immediately resolved the issue.

Namely, this rule was missing:

  # EndpointSlices are used for Service-based network policy rule
  # enforcement.
  - apiGroups: ["discovery.k8s.io"]
    resources:
      - endpointslices
    verbs:
      - watch
      - list

Is it a bug of felix? Probably, it should not fail if RBAC does not allow reading EndpointSlices. Or, maybe, print something helpful in the log.