projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.94k stars 1.32k forks source link

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503 #4605

Closed pagarwal-tibco closed 3 years ago

pagarwal-tibco commented 3 years ago

Getting following error for calico-node pod

Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503

Steps to Reproduce (for bugs)

I am deploying calico CNI in 2 node kubernetes Kind(https://github.com/kubernetes-sigs/kind) cluster. I keep seeing following liveness probe failures with following logs

2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:53.213 [INFO][53] felix/health.go 196: Overall health status changed newStatus=&health.HealthReport{Live:false, Ready:false} 2021-05-12 08:53:53.213 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:53:54.565 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:00.455 [INFO][56] monitor-addresses/startup.go 768: Using autodetected IPv4 address on interface eth0: 10.245.2.131/25 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:03.223 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:04.557 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:04.558 [WARNING][53] felix/health.go 154: Health: not ready 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:13.187 [WARNING][53] felix/health.go 165: Health: not live 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 66: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 184: Reporter is not live. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 55: Report timed out name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 188: Reporter is not ready. name="int_dataplane" 2021-05-12 08:54:14.537 [WARNING][53] felix/health.go 154: Health: not ready

Your Environment

Can someone please help?

pagarwal-tibco commented 3 years ago

I am still facing this issue. Can someone please help?

song-jiang commented 3 years ago

@pagarwal-tibco Did you installed Calico v3.18 on your kind cluster? What is the network backend, vxlan or BGP?

@neiljerram Could you help?

pagarwal-tibco commented 3 years ago

@song-jiang Calico backend is "bird".

Here is the yaml file used for deploying calico. Please note that CRDs are deployed separately. calico-all.yaml.zip

nelljerram commented 3 years ago

@pagarwal-tibco I think we will need more logs to understand this. Could you try changing

            - name: FELIX_LOGSEVERITYSCREEN
              value: "info"

to

            - name: FELIX_LOGSEVERITYSCREEN
              value: "debug"

and then redeploy, and attach one of the node logs here?

Also wondering about your KIND version and config. Here's a config sample from our own testing:

    ${KIND} create cluster --config - <<EOF
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: "192.168.128.0/17"
nodes:
# the control plane node
- role: control-plane
- role: worker
- role: worker
- role: worker
EOF

Is yours also like that?

Our testing is using https://github.com/kubernetes-sigs/kind/releases/download/v0.8.1/kind-linux-amd64. Could you try with that version - just in case something important has changed since then in KIND master?

pagarwal-tibco commented 3 years ago

We are using KIND version 0.9 and 0.10 as we need to use Kubernetes version 1.19 and 1.20. We are using following KIND config,

apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true
  podSubnet: '192.168.0.0/16'
  serviceSubnet: '192.168.240.0/20'
  apiServerPort: 6443
nodes:
- role: control-plane
- role: worker

Calico node debug logs are here calico.log

nelljerram commented 3 years ago

@pagarwal-tibco Thanks for the log. It indicates that the Felix component does become live after a few seconds. So perhaps the liveness problem is in another component. Can you check what kubectl describe says for a calico-node pod when it is not becoming live? There should be a message that gives a bit more detail about the problem.

pagarwal-tibco commented 3 years ago

@neiljerram Calico node pod keeps toggling between ready and not ready.

I see following event for calico-node

calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  36s (x17 over 9m24s)  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
pagarwal-tibco commented 3 years ago

@neiljerram I see same problem with calico 3.19.

  Warning  Unhealthy  13m (x14 over 23m)   kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
  Warning  Unhealthy  3m6s (x24 over 18m)  kubelet  (combined from similar events): Readiness probe failed: 2021-05-25 06:36:25.848 [INFO][6210] confd/health.go 180: Number of node(s) with BGP peering established = 1
calico/node is not ready: felix is not ready: readiness probe reporting 503

Please let me know if you need anymore information.

pagarwal-tibco commented 3 years ago

@neiljerram I am still facing this issue. Any pointers please?

pagarwal-tibco commented 3 years ago

Any updates on this issue?

lwr20 commented 3 years ago

I have seen these symptoms in a system that was starved of CPU. It might be worth trying this on a machine with more CPU?

lmm commented 3 years ago

Are the pod and service cidrs overlapping? Can you try removing the serviceSubnet?

pagarwal-tibco commented 3 years ago

@lmm I tried removing serviceSubnet and using non overlapping value. But still the same issue.

@lwr20 I have a good machine and top command shows CPU is idle as well.

lmm commented 3 years ago

@pagarwal-tibco are you using a Linux host? I cannot repro what you're seeing and we use kind quite a bit in our automated tests. Perhaps there is something on your host that is interfering with Calico.

If you're using a Mac, there is this kind issue that might be worth looking into: https://github.com/kubernetes-sigs/kind/issues/2308

pierluigilenoci commented 3 years ago

@caseydavenport why you closed the ticket?

nelljerram commented 3 years ago

@pierluigilenoci I presume because the OP did not respond since 13th July?

pierluigilenoci commented 3 years ago

A month is not that long. Maybe he took the covid or is on vacation. Let's try to stimulate him...

@pagarwal-tibco knock knock! dddd

caseydavenport commented 3 years ago

A month is plenty long - we usually close tickets without a response in 2-3 weeks. We can always re-open if the OP returns.

pagarwal-tibco commented 3 years ago

Sorry for late reply, I was away. I upgraded docker for mac to 3.6.0 and I confirm that it works now. So it seems that the issue was caused by docker for mac.

Thanks for all the help.

nelljerram commented 3 years ago

Thanks @pagarwal-tibco !

ciiiii commented 3 years ago

The same problem on k8s node(Ubuntu 18.04.5 LTS/5.4.0-60-generic)

Events:
  Type     Reason     Age                   From     Message
  ----     ------     ----                  ----     -------
  Warning  Unhealthy  11m (x8338 over 10d)  kubelet  (combined from similar events): Readiness probe failed: 2021-10-09 06:49:07.655 [INFO][27506] confd/health.go 180: Number of node(s) with BGP peering established = 76
calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  5m17s (x6281 over 20d)  kubelet  Liveness probe failed: calico/node is not ready: Felix is not live: liveness probe reporting 503
nelljerram commented 3 years ago

@ciiiii Please open a new issue, and describe

fcolista commented 2 years ago

Seems that nobody cares about this issue...

gondaz commented 2 years ago

I've been struggling with this issue for past few days and managed to fix this by editing a clusterrole resource. I have an RKE-based cluster (version 1.21.10), and I upgraded calico related images up to 3.21.5, after that the initial healthcheck issue had cropped up. Make sure you have the proper clusterrole manifest as following (copied from the original Calico website):

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: calico-node
rules:
  # The CNI plugin needs to get pods, nodes, and namespaces.
  - apiGroups: [""]
    resources:
      - pods
      - nodes
      - namespaces
    verbs:
      - get
  # EndpointSlices are used for Service-based network policy rule
  # enforcement.
  - apiGroups: ["discovery.k8s.io"]
    resources:
      - endpointslices
    verbs:
      - watch
      - list
  - apiGroups: [""]
    resources:
      - endpoints
      - services
    verbs:
      # Used to discover service IPs for advertisement.
      - watch
      - list
      # Used to discover Typhas.
      - get
  # Pod CIDR auto-detection on kubeadm needs access to config maps.
  - apiGroups: [""]
    resources:
      - configmaps
    verbs:
      - get
  - apiGroups: [""]
    resources:
      - nodes/status
    verbs:
      # Needed for clearing NodeNetworkUnavailable flag.
      - patch
      # Calico stores some configuration information in node annotations.
      - update
  # Watch for changes to Kubernetes NetworkPolicies.
  - apiGroups: ["networking.k8s.io"]
    resources:
      - networkpolicies
    verbs:
      - watch
      - list
  # Used by Calico for policy information.
  - apiGroups: [""]
    resources:
      - pods
      - namespaces
      - serviceaccounts
    verbs:
      - list
      - watch
  # The CNI plugin patches pods/status.
  - apiGroups: [""]
    resources:
      - pods/status
    verbs:
      - patch
  # Calico monitors various CRDs for config.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - globalfelixconfigs
      - felixconfigurations
      - bgppeers
      - globalbgpconfigs
      - bgpconfigurations
      - ippools
      - ipamblocks
      - globalnetworkpolicies
      - globalnetworksets
      - networkpolicies
      - networksets
      - clusterinformations
      - hostendpoints
      - blockaffinities
      - caliconodestatuses
    verbs:
      - get
      - list
      - watch
  # Calico must create and update some CRDs on startup.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ippools
      - felixconfigurations
      - clusterinformations
    verbs:
      - create
      - update
  # Calico stores some configuration information on the node.
  - apiGroups: [""]
    resources:
      - nodes
    verbs:
      - get
      - list
      - watch
  # These permissions are required for Calico CNI to perform IPAM allocations.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
      - ipamblocks
      - ipamhandles
    verbs:
      - get
      - list
      - create
      - update
      - delete
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - ipamconfigs
    verbs:
      - get
  # Block affinities must also be watchable by confd for route aggregation.
  - apiGroups: ["crd.projectcalico.org"]
    resources:
      - blockaffinities
    verbs:
      - watch

Hopefully it helps.

willzhang commented 2 years ago

I upgrade calico version resolved my probles, see https://github.com/kubesphere/kubekey/issues/1282

rajaie-sg commented 1 year ago

We ran into a similar issue and were able to resolve it by setting CPU requests for calico-node Pods. https://github.com/projectcalico/calico/issues/3420#issuecomment-1468897178

LiAuTraver commented 6 days ago

I checked my log and found I forgot to instll ipset which it requires to use. (I'm in a hush tho) So I just installed it and then the problem disappears.😂 maybe I'm dumb here, but I just post here if anyone runs into the same situation