projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.02k stars 1.34k forks source link

Calico-kube-controllers CrashLoopBackOff #7103

Open xiaomizhg opened 1 year ago

xiaomizhg commented 1 year ago

Calico-kube-controllers Crash

Environment

Problem Description

This is my first deployment of calico, using the https://docs.projectcalico.org/manifests/calico.yaml , Key Yaml Config: yaml

When the deployment is completed, the calico-kube-controllers pod is displayed as Crash Status。

pod状态

kubectl describe po calico-kube-controllers-767f55c9d8-ddj6n -n kube-system

po异常

Then I checked the calico-kube-controllers-767f55c9d8-ddj6n pod log

2022-12-21 15:14:41.721 [INFO][1] watchersyncer.go 130: Sending status update Status=in-sync
2022-12-21 15:14:41.721 [INFO][1] syncer.go 86: Node controller syncer status updated: in-sync
** 2022-12-21 15:14:41.722 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied **
2022-12-21 15:14:41.722 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2022-12-21 15:14:41.730 [INFO][1] hostendpoints.go 173: successfully synced all hostendpoints
I1221 15:14:41.818558       1 shared_informer.go:262] Caches are synced for nodes
I1221 15:14:41.818607       1 shared_informer.go:255] Waiting for caches to sync for pods
I1221 15:14:41.818630       1 shared_informer.go:262] Caches are synced for pods
2022-12-21 15:14:41.818 [INFO][1] ipam.go 253: Will run periodic IPAM sync every 7m30s
2022-12-21 15:14:41.819 [INFO][1] ipam.go 331: Syncer is InSync, kicking sync channel status=in-sync

Check the node status with the calicoctl: node-status

The node status should be normal, but calico-kube-controllers pod readinessProbe failure . I don't know what the problem is. Can someone help me ?

caseydavenport commented 1 year ago

Looks like a bug was somehow introduced into our status reporting logic - as a workaround, you can remove the readinessprobe from calico-kube-controllers, but obviously that's not a long term solution. We'll need to figure out why that location isn't writeable.

The offending log is:

Failed to write readiness file: open /status/status.json: permission denied

For the future, it's better to use plaintext for issues rather than images - it makes them more easily searchable.

cyclinder commented 1 year ago

https://github.com/projectcalico/calico/blob/4c638885a88b2b598382b9d757b99ef90a135db2/kube-controllers/pkg/status/status.go#L136

writerFile permission is 0644, It seems right

pullusers commented 1 year ago

You can delete the pod "calico-kube-controllers",after regenerating the pod, it will return to normal.

caseydavenport commented 1 year ago

@xiaomizhg did you make any progress on this one?

631068264 commented 1 year ago

You can delete the pod "calico-kube-controllers",after regenerating the pod, it will return to normal.

@pullusers NO, it can't work

Same issue in k8s 1.23.6 rancher/mirrored-calico-kube-controllers:v3.22.0

2023-01-20 07:07:09.901 [INFO][1] main.go 97: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0120 07:07:09.903717       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2023-01-20 07:07:09.904 [INFO][1] main.go 118: Ensuring Calico datastore is initialized
2023-01-20 07:07:09.926 [INFO][1] main.go 159: Getting initial config snapshot from datastore
2023-01-20 07:07:09.943 [INFO][1] main.go 162: Got initial config snapshot
2023-01-20 07:07:09.944 [INFO][1] watchersyncer.go 89: Start called
2023-01-20 07:07:09.944 [INFO][1] main.go 179: Starting status report routine
2023-01-20 07:07:09.944 [INFO][1] main.go 188: Starting Prometheus metrics server on port 9094
2023-01-20 07:07:09.944 [INFO][1] watchersyncer.go 127: Sending status update Status=wait-for-ready
2023-01-20 07:07:09.944 [INFO][1] main.go 463: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc0009e3cb0), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc000172460), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc0009e3c98), objectType:(*v1.Pod)(0xc000101000), resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x2e2b5b0), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil)}
2023-01-20 07:07:09.944 [INFO][1] main.go 463: Starting informer informer=&cache.sharedIndexInformer{indexer:(*cache.cache)(0xc0009e3cf8), controller:cache.Controller(nil), processor:(*cache.sharedProcessor)(0xc0001724d0), cacheMutationDetector:cache.dummyMutationDetector{}, listerWatcher:(*cache.ListWatch)(0xc0009e3ce0), objectType:(*v1.Node)(0xc00031cc00), resyncCheckPeriod:0, defaultEventHandlerResyncPeriod:0, clock:(*clock.RealClock)(0x2e2b5b0), started:false, stopped:false, startedLock:sync.Mutex{state:0, sema:0x0}, blockDeltas:sync.Mutex{state:0, sema:0x0}, watchErrorHandler:(cache.WatchErrorHandler)(nil)}
2023-01-20 07:07:09.944 [INFO][1] syncer.go 78: Node controller syncer status updated: wait-for-ready
2023-01-20 07:07:09.944 [INFO][1] main.go 469: Starting controller ControllerType="Node"
2023-01-20 07:07:09.944 [INFO][1] controller.go 189: Starting Node controller
2023-01-20 07:07:09.944 [INFO][1] watchersyncer.go 147: Starting main event processing loop
2023-01-20 07:07:09.944 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/ipam/v2/assignment/"
2023-01-20 07:07:09.944 [INFO][1] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-01-20 07:07:09.944 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.944 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.944 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.944 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.944 [INFO][1] resources.go 350: Main client watcher loop
2023-01-20 07:07:09.946 [INFO][1] watchercache.go 273: Sending synced update ListRoot="/calico/ipam/v2/assignment/"
2023-01-20 07:07:09.947 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-01-20 07:07:09.947 [INFO][1] watchersyncer.go 127: Sending status update Status=resync
2023-01-20 07:07:09.947 [INFO][1] syncer.go 78: Node controller syncer status updated: resync
2023-01-20 07:07:09.947 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.947 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.949 [ERROR][1] status.go 138: Failed to write readiness file: open /status/status.json: permission denied
2023-01-20 07:07:09.949 [WARNING][1] status.go 66: Failed to write status error=open /status/status.json: permission denied
2023-01-20 07:07:09.949 [INFO][1] watchercache.go 273: Sending synced update ListRoot="/calico/resources/v3/projectcalico.org/nodes"
2023-01-20 07:07:09.949 [INFO][1] watchersyncer.go 209: Received InSync event from one of the watcher caches
2023-01-20 07:07:09.949 [INFO][1] watchersyncer.go 221: All watchers have sync'd data - sending data and final sync
2023-01-20 07:07:09.949 [INFO][1] watchersyncer.go 127: Sending status update Status=in-sync
2023-01-20 07:07:09.949 [INFO][1] syncer.go 78: Node controller syncer status updated: in-sync
2023-01-20 07:07:09.958 [INFO][1] hostendpoints.go 177: successfully synced all hostendpoints
2023-01-20 07:07:10.044 [INFO][1] ipam.go 202: Will run periodic IPAM sync every 7m30s
2023-01-20 07:07:10.044 [INFO][1] ipam.go 280: Syncer is InSync, kicking sync channel status=in-sync
kubectl get pods -n kube-system
NAME                                      READY   STATUS    RESTARTS          AGE
calico-kube-controllers-fc7fcb565-llgwd   0/1     Running   424 (5m55s ago)   25h
canal-ccjxw                               2/2     Running   0                 25h
coredns-58d67995c7-2kt8t                  1/1     Running   0                 25h
coredns-autoscaler-7d5478875b-6r2b2       1/1     Running   0                 25h
kube-eventer-66fcb6c6c6-hsv6n             1/1     Running   0                 25h
kube-flannel-5pcd2                        2/2     Running   0                 25h
metrics-server-5c4895ffbd-kggmc           1/1     Running   0                 25h
tiansiyuan commented 1 year ago

You can delete the pod "calico-kube-controllers",after regenerating the pod, it will return to normal.

This works for me.

caduceus4 commented 1 year ago

I had the same problem, deleting the pod and having k8s recreate it did not fix it). It turns out that the calico-kube-controller must run with some sort of privilege to write /status/status.json. My installation (rancher 2.7.3, k8s 1.25.7 on rocky 9, with calico-kube-controller v3.25.0) uses gatekeeper 'assign' resources as part of a mutating webhook that sets things like runAsUser: 65534, runAsGroup: 65534, privileged: false, allowprivilegeEscalation: false and others in a securityContext if they do not exist in a podSpec. If I exempt the calico-system namespace from said mutating webhook, The calico-kube-controller can write that file just fine. I have not had to exempt the k8s namespaces such as kube-system from this mutating webhook. That suggests the stuff running there explicity sets these things (but I have not looked in detail). It would be nice, especially for system-level stuff such as calico, if it would explicity set via a securityContext those privileges it needs to run. https://github.com/projectcalico/calico/blob/master/charts/calico/templates/calico-kube-controllers.yaml does not have any SecurityContext set. If it did, I would have caught what I needed to change alot sooner. With the advent of Pod Security Standards, along with kyverno and gatekeeper, it would be good for system level stuff such as calico to explicitly set privileges it needs to make integration easier as opposed to relying on defaults

tongvt commented 1 year ago

I got the same issue, any update for it?