projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
6.04k stars 1.35k forks source link

Calico 3.29.0 leading to kubernetes controller manager GC failures #9527

Open adkafka opened 5 hours ago

adkafka commented 5 hours ago

After updating our EKS 1.30 clusters from Calico (no Tigera operator) 3.28.0 to 3.29.0, we later encountered an odd issue on our clusters. Interestingly, the errors below did not start when we performed the Calico update, but began after the first restart of the control plane (that was performed automatically by Amazon / EKS).

The impacts we noticed from this issue were caused by the failed GC. The two issues we noticed were: 1) Namespace quota usage increased monotonically and quickly hit our limits, which prevented pods from being scheduled in the namespaces which had quotas 2) Pods (ie Jobs) with containers that completed were left hanging around. This caused excess resource usage on our cluster and prevented new pods from being scheduled.

We were able to fix this by rolling back the Calico change (switch back to 3.28.0, deleting the tier CRD, then (on some clusters at least) restarting the control plane). For most of our clusters, it seemed like the issue went away immediately after deleting the tier CRD: kubectl delete crd tiers.crd.projectcalico.org.

Here are some relevant logs from the issue. I've added my own comments in <these blocks>.

<Calico update is performed days in advanced and appears functional. Looking at the logs though, we do see these errors before the control plane was restarted and the cluster degraded.>
...
W1114 07:59:20.299032      11 reflector.go:547] k8s.io/client-go/metadata/metadatainformer/informer.go:138: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
E1114 07:59:20.299061      11 reflector.go:150] k8s.io/client-go/metadata/metadatainformer/informer.go:138: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
W1114 07:59:53.415144      11 reflector.go:547] k8s.io/client-go/metadata/metadatainformer/informer.go:138: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
E1114 07:59:53.415178      11 reflector.go:150] k8s.io/client-go/metadata/metadatainformer/informer.go:138: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
W1114 07:59:54.171081      11 reflector.go:547] k8s.io/client-go/metadata/metadatainformer/informer.go:138: failed to list *v1.PartialObjectMetadata: networkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
E1114 07:59:54.171110      11 reflector.go:150] k8s.io/client-go/metadata/metadatainformer/informer.go:138: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: networkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
...
<Control plane restart begins>
...
I1114 08:55:52.174295      11 gc_controller.go:78] Starting apiserver lease garbage collector
I1114 08:56:02.486070      11 gc_controller.go:78] Starting apiserver lease garbage collector
I1114 09:01:44.306680      11 gc_controller.go:78] Starting apiserver lease garbage collector
I1114 09:02:39.919677      11 gc_controller.go:91] Shutting down apiserver lease garbage collector
I1114 09:03:06.557262      11 gc_controller.go:91] Shutting down apiserver lease garbage collector
I1114 09:03:23.481456      11 gc_controller.go:101] "Starting GC controller" logger="pod-garbage-collector-controller"
...
<First log about quota issues, which was the most impactful issue for us>
...
E1114 09:03:53.603391      11 resource_quota_controller.go:492] timed out waiting for quota monitor sync
...
<We start seeing GC issues>
..
E1114 09:03:53.661098      11 shared_informer.go:316] unable to sync caches for garbage collector
E1114 09:03:53.661137      11 garbagecollector.go:268] timed out waiting for dependency graph builder sync during GC sync (attempt 1)
E1114 09:04:23.767288      11 shared_informer.go:316] unable to sync caches for garbage collector
E1114 09:04:23.767308      11 garbagecollector.go:268] timed out waiting for dependency graph builder sync during GC sync (attempt 2)
E1114 09:04:53.867825      11 shared_informer.go:316] unable to sync caches for garbage collector
E1114 09:04:53.867846      11 garbagecollector.go:268] timed out waiting for dependency graph builder sync during GC sync (attempt 3)
...
<These GC errors continue until we complete reverting to 3.28.0>
...
I1114 09:08:11.234790      11 gc_controller.go:91] Shutting down apiserver lease garbage collector
I1114 09:08:25.854920      11 gc_controller.go:101] "Starting GC controller" logger="pod-garbage-collector-controller"
...

The namespace quota issues can be partially explained by https://github.com/kubernetes/kubernetes/issues/98071 and the failed GCs.

The Operation on Calico tiered policy is forbidden errors seem particular of interest. Through AWS support, they were able to track this unauthorized error to what appears to be a missing permission or a broken webhook:

Additionally, upon reviewing logs in the authorizer, the request was denied for the following reason:

EKS Access Policy: no rules authorize user "system:kube-controller-manager" with groups ["system:authenticated"] to "get" > resource "tiers.projectcalico.org" named "adminnetworkpolicy" cluster-wide

This issue resembles scenarios where list and watch calls are blocked due to a broken webhook.

The calico-apiserver logs show a similar error:

I1114 08:59:53.150547       1 httplog.go:132] "HTTP" verb="LIST" URI="/apis/projectcalico.org/v3/globalnetworkpolicies?resourceVersion=2329233791" latency="8.014323ms" userAgent="kube-controller-manager/v1.30.5 (linux/amd64) kubernetes/194b08f/metadata-informers" audit-ID="c09811c4-7bf4-4f04-8e51-647bc6bf9d6c" srcIP="redacted" resp=403

This led me to this recent issue: https://github.com/projectcalico/calico/issues/9481. Could that be the root cause for our GC problems? Could a 403 on the webhook lead to these garbage collection errors and subsequent cluster degredation?

We've also seen similar errors before in previous Calico versions:

Expected Behavior

Upgrade to 3.29 should not cause cluster-wide issues

Current Behavior

Upgrade to 3.29 (after control plane restart) causes GC issues which led to resource quota and pod cleanup issues.

Possible Solution

Resolve the regression in 3.29 that we are hitting against here.

Steps to Reproduce (for bugs)

  1. Create EKS 1.30 cluster using Calico 3.28.0
  2. Upgrade Calico to 3.29.0, including the new CRDs
  3. Wait for a control plane restart from Amazon (or open up a support ticket to ask them to perform one)
  4. Witness similar error messages in the control plane logs

Context

Our goal with upgrading to 3.29.0 is to unblock our upgrade to EKS 1.31. We want to stay up-to-date with the latest versions to avoid unanticipated issues.

Your Environment