Closed pagarwal-tibco closed 3 years ago
I am still facing this issue. Can someone please help?
@pagarwal-tibco Did you installed Calico v3.18 on your kind cluster? What is the network backend, vxlan or BGP?
@neiljerram Could you help?
@song-jiang Calico backend is "bird".
Here is the yaml file used for deploying calico. Please note that CRDs are deployed separately. calico-all.yaml.zip
@pagarwal-tibco I think we will need more logs to understand this. Could you try changing
- name: FELIX_LOGSEVERITYSCREEN
value: "info"
to
- name: FELIX_LOGSEVERITYSCREEN
value: "debug"
and then redeploy, and attach one of the node logs here?
Also wondering about your KIND version and config. Here's a config sample from our own testing:
${KIND} create cluster --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: true
podSubnet: "192.168.128.0/17"
nodes:
# the control plane node
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF
Is yours also like that?
Our testing is using https://github.com/kubernetes-sigs/kind/releases/download/v0.8.1/kind-linux-amd64. Could you try with that version - just in case something important has changed since then in KIND master?
We are using KIND version 0.9 and 0.10 as we need to use Kubernetes version 1.19 and 1.20. We are using following KIND config,
apiVersion: kind.x-k8s.io/v1alpha4
networking:
disableDefaultCNI: true
podSubnet: '192.168.0.0/16'
serviceSubnet: '192.168.240.0/20'
apiServerPort: 6443
nodes:
- role: control-plane
- role: worker
Calico node debug logs are here calico.log
@pagarwal-tibco Thanks for the log. It indicates that the Felix component does become live after a few seconds. So perhaps the liveness problem is in another component. Can you check what kubectl describe
says for a calico-node pod when it is not becoming live? There should be a message that gives a bit more detail about the problem.
@neiljerram Calico node pod keeps toggling between ready and not ready.
I see following event for calico-node
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 36s (x17 over 9m24s) kubelet Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
@neiljerram I see same problem with calico 3.19.
Warning Unhealthy 13m (x14 over 23m) kubelet Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
Warning Unhealthy 3m6s (x24 over 18m) kubelet (combined from similar events): Readiness probe failed: 2021-05-25 06:36:25.848 [INFO][6210] confd/health.go 180: Number of node(s) with BGP peering established = 1
calico/node is not ready: felix is not ready: readiness probe reporting 503
Please let me know if you need anymore information.
@neiljerram I am still facing this issue. Any pointers please?
Any updates on this issue?
I have seen these symptoms in a system that was starved of CPU. It might be worth trying this on a machine with more CPU?
Are the pod and service cidrs overlapping? Can you try removing the serviceSubnet
?
@lmm I tried removing serviceSubnet
and using non overlapping value. But still the same issue.
@lwr20 I have a good machine and top command shows CPU is idle as well.
@pagarwal-tibco are you using a Linux host? I cannot repro what you're seeing and we use kind quite a bit in our automated tests. Perhaps there is something on your host that is interfering with Calico.
If you're using a Mac, there is this kind issue that might be worth looking into: https://github.com/kubernetes-sigs/kind/issues/2308
@caseydavenport why you closed the ticket?
@pierluigilenoci I presume because the OP did not respond since 13th July?
A month is not that long. Maybe he took the covid or is on vacation. Let's try to stimulate him...
@pagarwal-tibco knock knock!
A month is plenty long - we usually close tickets without a response in 2-3 weeks. We can always re-open if the OP returns.
Sorry for late reply, I was away. I upgraded docker for mac to 3.6.0 and I confirm that it works now. So it seems that the issue was caused by docker for mac.
Thanks for all the help.
Thanks @pagarwal-tibco !
The same problem on k8s node(Ubuntu 18.04.5 LTS/5.4.0-60-generic)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 11m (x8338 over 10d) kubelet (combined from similar events): Readiness probe failed: 2021-10-09 06:49:07.655 [INFO][27506] confd/health.go 180: Number of node(s) with BGP peering established = 76
calico/node is not ready: felix is not ready: readiness probe reporting 503
Warning Unhealthy 5m17s (x6281 over 20d) kubelet Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
@ciiiii Please open a new issue, and describe
Seems that nobody cares about this issue...
I've been struggling with this issue for past few days and managed to fix this by editing a clusterrole resource. I have an RKE-based cluster (version 1.21.10), and I upgraded calico related images up to 3.21.5, after that the initial healthcheck issue had cropped up. Make sure you have the proper clusterrole manifest as following (copied from the original Calico website):
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: calico-node
rules:
# The CNI plugin needs to get pods, nodes, and namespaces.
- apiGroups: [""]
resources:
- pods
- nodes
- namespaces
verbs:
- get
# EndpointSlices are used for Service-based network policy rule
# enforcement.
- apiGroups: ["discovery.k8s.io"]
resources:
- endpointslices
verbs:
- watch
- list
- apiGroups: [""]
resources:
- endpoints
- services
verbs:
# Used to discover service IPs for advertisement.
- watch
- list
# Used to discover Typhas.
- get
# Pod CIDR auto-detection on kubeadm needs access to config maps.
- apiGroups: [""]
resources:
- configmaps
verbs:
- get
- apiGroups: [""]
resources:
- nodes/status
verbs:
# Needed for clearing NodeNetworkUnavailable flag.
- patch
# Calico stores some configuration information in node annotations.
- update
# Watch for changes to Kubernetes NetworkPolicies.
- apiGroups: ["networking.k8s.io"]
resources:
- networkpolicies
verbs:
- watch
- list
# Used by Calico for policy information.
- apiGroups: [""]
resources:
- pods
- namespaces
- serviceaccounts
verbs:
- list
- watch
# The CNI plugin patches pods/status.
- apiGroups: [""]
resources:
- pods/status
verbs:
- patch
# Calico monitors various CRDs for config.
- apiGroups: ["crd.projectcalico.org"]
resources:
- globalfelixconfigs
- felixconfigurations
- bgppeers
- globalbgpconfigs
- bgpconfigurations
- ippools
- ipamblocks
- globalnetworkpolicies
- globalnetworksets
- networkpolicies
- networksets
- clusterinformations
- hostendpoints
- blockaffinities
- caliconodestatuses
verbs:
- get
- list
- watch
# Calico must create and update some CRDs on startup.
- apiGroups: ["crd.projectcalico.org"]
resources:
- ippools
- felixconfigurations
- clusterinformations
verbs:
- create
- update
# Calico stores some configuration information on the node.
- apiGroups: [""]
resources:
- nodes
verbs:
- get
- list
- watch
# These permissions are required for Calico CNI to perform IPAM allocations.
- apiGroups: ["crd.projectcalico.org"]
resources:
- blockaffinities
- ipamblocks
- ipamhandles
verbs:
- get
- list
- create
- update
- delete
- apiGroups: ["crd.projectcalico.org"]
resources:
- ipamconfigs
verbs:
- get
# Block affinities must also be watchable by confd for route aggregation.
- apiGroups: ["crd.projectcalico.org"]
resources:
- blockaffinities
verbs:
- watch
Hopefully it helps.
I upgrade calico version resolved my probles, see https://github.com/kubesphere/kubekey/issues/1282
We ran into a similar issue and were able to resolve it by setting CPU requests for calico-node Pods. https://github.com/projectcalico/calico/issues/3420#issuecomment-1468897178
I checked my log and found I forgot to instll ipset
which it requires to use. (I'm in a hush tho)
So I just installed it and then the problem disappears.😂
maybe I'm dumb here, but I just post here if anyone runs into the same situation
Getting following error for calico-node pod
Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
Steps to Reproduce (for bugs)
I am deploying calico CNI in 2 node kubernetes Kind(https://github.com/kubernetes-sigs/kind) cluster. I keep seeing following liveness probe failures with following logs
2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:53.213 [INFO][53] felix/health.go 196: Overall health status changed newStatus=&health.HealthReport{Live:false, Ready:false} 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:00.455 [INFO][56] monitor-addresses/startup.go 768: Using autodetected IPv4 address on interface eth0: 10.245.2.131/25 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:04.557 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 154: Health: not ready
Your Environment
Can someone please help?