Closed ledroide closed 3 years ago
@ledroide it's possible that you're not seeing error logs because calico/node has restarted.
Could you try using the -p
argument to kubectl logs to receive the previous container's log output?
e.g.,
kubectl logs -p -n kube-system <calico-node-pod-x>
hopefully that gives us a clue.
Tanks @caseydavenport . We are used to "kubectl logs" feature. I personally use stern for debugging in command line, which allows to monitoring logs from multiple pods simultaneously : stern --container-state terminated --tail 50 calico-node
outputs the 50 last lines of containers that have crashed.
Also, we use Promtail + Loki + Grafana to access logs from containers that do not run anymore.
As you have requested, you'll find on this gist 2 sections :
I have a similar issue as above. v3.19.2 fails for Ubuntu 20.04 VMs running in the same VLAN on Proxmox. However, v3.18.5 is perfectly fine.
@ledroide I just noticed this part from your issue:
kubectl set image daemonset/calico-node -n kube-system calico-node=quay.io/calico/node:v3.19.2
In general this is a bad practice - we can't guarantee that the same manifests that work with version X will work with version X+1. Our recommendation for upgrade is to review and apply the updated manifests for the new release, as they may have required changes for the new version.
It's not obvious to me from the logs what is going wrong, but it feels likely that this step could be related. Are you seeing this on fresh installs if you use one of the manifests from https://docs.projectcalico.org/archive/v3.19/getting-started/kubernetes/ ?
On a related note, the tigera-operator simplifies a lot of this upgrade so that it becomes as simple as replacing the image, so if you are able I'd recommend taking a look at that installer.
@BongoEADGC6 any chance you're doing a similar workflow (starting with one version and update the image to a new version in-place)?
@caseydavenport Good evening. Yes, that is correct. Kubespray will typically update these image tags as part of the cluster upgrade process. Who would be the best course of action for them to allow for upgrades to occur? Update the tags version and then reapply the CRDs?
@caseydavenport : At first, it's Kubespray that updated the calico-node manifests with v3.19.2. Then the problem raised, we have searched during hours before trying to change the image tag.
In order to dispel any doubt about the Kubespray manifest involvement in this issue, I have check in Kubespray sources : there is no change - except the daemonset calico-node image tag - that could have an effect on worker nodes communication. The Kubespray implementation for Calico looks fine, it had always worked, and there is no reason for this implementation to be wrong for calico-node v3.19.2
None of the BGP backend processes (BIRD or GoBGP) are running.
This error in the original post suggests that the BGP agent is failing to run. Could you check the output of the following command in one of your failing calico/node pods?
kubectl exec -ti <calico-node-name> -- ls /etc/calico/confd/config
I have not yet been able to reproduce this issue myself, just seems to work for me, so guessing it is something particular about KubeSpray or your environment that I am missing.
Who would be the best course of action for them to allow for upgrades to occur? Update the tags version and then reapply the CRDs?
The best course of action long-term is going to be to upgrade to use the operator-based install method, as it handles upgrading configuration as needed for each release.
Barring that, you will need to analyze the diff between the manifests on docs.projectcalico.org/archive/v3.18 vs the ones at docs.projectcalico.org/archive/v3.19 and make the corresponding updates when upgrading.
In the case of v3.18 -> v3.19, it looks like that diff is pretty small, but that won't always be the case.
513a514,520
> bpfExtToServiceConnmark:
> description: 'BPFExtToServiceConnmark in BPF mode, control a 32bit
> mark that is set on connections from an external client to a local
> service. This mark allows us to control how packets of that connection
> are routed within the host and how is routing intepreted by RPF
> check. [Default: 0]'
> type: integer
602,603c609,610
< description: 'FailsafeInboundHostPorts is a comma-delimited list of
< UDP/TCP ports that Felix will allow incoming traffic to host endpoints
---
> description: 'FailsafeInboundHostPorts is a list of UDP/TCP ports
> and CIDRs that Felix will allow incoming traffic to host endpoints
605,609c612,616
< accidentally cutting off a host with incorrect configuration. Each
< port should be specified as tcp:<port-number> or udp:<port-number>.
< For back-compatibility, if the protocol is not specified, it defaults
< to "tcp". To disable all inbound host ports, use the value none.
< The default value allows ssh access and DHCP. [Default: tcp:22,
---
> accidentally cutting off a host with incorrect configuration. For
> back-compatibility, if the protocol is not specified, it defaults
> to "tcp". If a CIDR is not specified, it will allow traffic from
> all addresses. To disable all inbound host ports, use the value
> none. The default value allows ssh access and DHCP. [Default: tcp:22,
612,613c619,620
< description: ProtoPort is combination of protocol and port, both
< must be specified.
---
> description: ProtoPort is combination of protocol, port, and CIDR.
> Protocol and port must be specified.
614a622,623
> net:
> type: string
625,635c634,644
< description: 'FailsafeOutboundHostPorts is a comma-delimited list
< of UDP/TCP ports that Felix will allow outgoing traffic from host
< endpoints to irrespective of the security policy. This is useful
< to avoid accidentally cutting off a host with incorrect configuration.
< Each port should be specified as tcp:<port-number> or udp:<port-number>.
< For back-compatibility, if the protocol is not specified, it defaults
< to "tcp". To disable all outbound host ports, use the value none.
< The default value opens etcd''s standard ports to ensure that Felix
< does not get cut off from etcd as well as allowing DHCP and DNS.
< [Default: tcp:179, tcp:2379, tcp:2380, tcp:6443, tcp:6666, tcp:6667,
< udp:53, udp:67]'
---
> description: 'FailsafeOutboundHostPorts is a list of UDP/TCP ports
> and CIDRs that Felix will allow outgoing traffic from host endpoints
> to irrespective of the security policy. This is useful to avoid
> accidentally cutting off a host with incorrect configuration. For
> back-compatibility, if the protocol is not specified, it defaults
> to "tcp". If a CIDR is not specified, it will allow traffic from
> all addresses. To disable all outbound host ports, use the value
> none. The default value opens etcd''s standard ports to ensure that
> Felix does not get cut off from etcd as well as allowing DHCP and
> DNS. [Default: tcp:179, tcp:2379, tcp:2380, tcp:6443, tcp:6666,
> tcp:6667, udp:53, udp:67]'
637,638c646,647
< description: ProtoPort is combination of protocol and port, both
< must be specified.
---
> description: ProtoPort is combination of protocol, port, and CIDR.
> Protocol and port must be specified.
639a649,650
> net:
> type: string
3536c3547
< image: docker.io/calico/cni:v3.18.5
---
> image: docker.io/calico/cni:v3.19.3
3563c3574
< image: docker.io/calico/cni:v3.18.5
---
> image: docker.io/calico/cni:v3.19.3
3604c3615
< image: docker.io/calico/pod2daemon-flexvol:v3.18.5
---
> image: docker.io/calico/pod2daemon-flexvol:v3.19.3
3615c3626
< image: docker.io/calico/node:v3.18.5
---
> image: docker.io/calico/node:v3.19.3
3683,3685d3693
< # Set Felix logging to "info"
< - name: FELIX_LOGSEVERITYSCREEN
< value: "info"
3825c3833
< image: docker.io/calico/kube-controllers:v3.18.5
---
> image: docker.io/calico/kube-controllers:v3.19.3
3831a3840,3847
> livenessProbe:
> exec:
> command:
> - /usr/bin/check-status
> - -l
> periodSeconds: 10
> initialDelaySeconds: 10
> failureThreshold: 6
3836a3853
> periodSeconds: 10
As you have requested, you'll find on this gist 2 sections :
@ledroide I suspect the reason I see nothing interesting there is because the error happens earlier in the log. It sounds likely to me that for some reason BIRD isn't starting, and then the liveness check is eventually failing and Kubernetes restarts the pod.
@caseydavenport : you had the right intuition. The origin of the issue is in the estimated time for the command executed as liveness and readiness probes, in the DaemonSet/calico-node. Obviously, it takes mode processing time since calico-node:3.19.0.
The secondary issue concerns the daemonset template from the kubespray implementation :
livenessProbe:
exec:
command:
- /bin/calico-node
- -felix-live
{% if calico_network_backend|default("bird") == "bird" %}
- -bird-live
{% endif %}
periodSeconds: 10
initialDelaySeconds: 10
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
{% if calico_network_backend|default("bird") == "bird" %}
- -bird-ready
{% endif %}
- -felix-ready
periodSeconds: 10
failureThreshold: 6
You will notice that there is no timeoutSeconds defined, so it takes the default value, that is "1".
$ kubectl explain pod.spec.containers.livenessProbe.timeoutSeconds
KIND: Pod
VERSION: v1
FIELD: timeoutSeconds <integer>
DESCRIPTION:
Number of seconds after which the probe times out. Defaults to 1 second.
Minimum value is 1. More info:
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle#container-probes
New setting :
quay.io/calico/node:v3.19.2
This setting runs fine, as expected.
However, I have tried v3.20.0 and 3.20.1, and they remain in NotReady status. I have tried higher readinessProbe.timeoutSeconds values, but it didn't get Ready status anyway.
I have also checked Calico's implementation on this repository. In this doc getting-started/kubernetes/hardway/install-node.md
It does not either set the timeoutSeconds values, that means kubernetes will also set "1" second by default. Maybe that explains why Kubespray cherry-picked the same settings on you repository.
livenessProbe:
httpGet:
path: /liveness
port: 9099
host: localhost
periodSeconds: 10
initialDelaySeconds: 10
failureThreshold: 6
readinessProbe:
exec:
command:
- /bin/calico-node
- -bird-ready
- -felix-ready
periodSeconds: 10
I'm opening an issue #7979 in Kubespray.
Thanks for following up with the great explanation @ledroide
However, I have tried v3.20.0 and 3.20.1, and they remain in NotReady status.
This bit is still concerning to me and probably needs more investigation. Will check to see if there is a manifest diff missing.
This issue can be closed, since the root cause is identified, and my case with secondsTimeout values is solved by Kubespray PR approval.
However, I have tried v3.20.0 and 3.20.1, and they remain in NotReady status.
This bit is still concerning to me and probably needs more investigation. Will check to see if there is a manifest diff missing.
I had the same issue when upgrading from v3.19.3 to v3.20.2. Replacing the image tag leads to permanent NotReady state with logs full of
2021-11-09 15:51:50.231 [WARNING][83] felix/health.go 188: Reporter is not ready. name="int_dataplane"
2021-11-09 15:51:50.231 [WARNING][83] felix/health.go 188: Reporter is not ready. name="async_calc_graph"
2021-11-09 15:51:50.231 [WARNING][83] felix/health.go 154: Health: not ready
In my case, the reason was the outdated ClusterRole calico-node
. Applying the ClusterRoles from the v3.20 manifests immediately resolved the issue.
Namely, this rule was missing:
# EndpointSlices are used for Service-based network policy rule
# enforcement.
- apiGroups: ["discovery.k8s.io"]
resources:
- endpointslices
verbs:
- watch
- list
Is it a bug of felix
? Probably, it should not fail if RBAC does not allow reading EndpointSlices
. Or, maybe, print something helpful in the log.
Calico >= v3.19.0 loop-crashes on physical worker nodes only.
Effects
Networking inside Kubernetes (pods, services, etc.) does not work anymore after upgrading calico to v3.19.0 or higher.
Symptoms
Summary :
Version we have tried
Before calico-node is updated
With calico-node:v3.18.4, all pods are Running and Ready.
calicoctl node status is fine everywhere :
Reproduce the issue
Then, to reproduce the issue, we upgrade to any version >= v3.19.0
At first, status is Running, half of worker node remain NotReady, then the kubelet restarts them. Then other servers get a NotReady status, and then they are restarted.
Then it crashes again and again, Restarts count increases, network for Kubernetes pods and services do not work anymore.
On worker nodes
A random node :
An other random node :
Whereas BGP port answers from and to every other node, i.e. :
logs
We could not find any error message in calico pods, even when they are restarted by the kubelet.
However, when using calico-node >= 3.19.0, we can see these logs in dmesg (here calico-node:v3.19.2):
But we've got similar logs when everything runs fine with calico-node:3.18.4, and they can be found also servers that have no relation to this cluster. Said so, it does not look relevant to this issue.
What we have also tried
Disable IPv6 and restart networkmanager - All nodes (no effect)
Disable IPv6 and reboot - All nodes (no effect)
Environment