After updating our EKS 1.30 clusters from Calico (no Tigera operator) 3.28.0 to 3.29.0, we later encountered an odd issue on our clusters. Interestingly, the errors below did not start when we performed the Calico update, but began after the first restart of the control plane (that was performed automatically by Amazon / EKS).
The impacts we noticed from this issue were caused by the failed GC. The two issues we noticed were:
1) Namespace quota usage increased monotonically and quickly hit our limits, which prevented pods from being scheduled in the namespaces which had quotas
2) Pods (ie Jobs) with containers that completed were left hanging around. This caused excess resource usage on our cluster and prevented new pods from being scheduled.
We were able to fix this by rolling back the Calico change (switch back to 3.28.0, deleting the tier CRD, then (on some clusters at least) restarting the control plane). For most of our clusters, it seemed like the issue went away immediately after deleting the tier CRD: kubectl delete crd tiers.crd.projectcalico.org.
Here are some relevant logs from the issue. I've added my own comments in <these blocks>.
<Calico update is performed days in advanced and appears functional. Looking at the logs though, we do see these errors before the control plane was restarted and the cluster degraded.>
...
W1114 07:59:20.299032 11 reflector.go:547] k8s.io/client-go/metadata/metadatainformer/informer.go:138: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
E1114 07:59:20.299061 11 reflector.go:150] k8s.io/client-go/metadata/metadatainformer/informer.go:138: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
W1114 07:59:53.415144 11 reflector.go:547] k8s.io/client-go/metadata/metadatainformer/informer.go:138: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
E1114 07:59:53.415178 11 reflector.go:150] k8s.io/client-go/metadata/metadatainformer/informer.go:138: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: globalnetworkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
W1114 07:59:54.171081 11 reflector.go:547] k8s.io/client-go/metadata/metadatainformer/informer.go:138: failed to list *v1.PartialObjectMetadata: networkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
E1114 07:59:54.171110 11 reflector.go:150] k8s.io/client-go/metadata/metadatainformer/informer.go:138: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: networkpolicies.projectcalico.org is forbidden: Operation on Calico tiered policy is forbidden
...
<Control plane restart begins>
...
I1114 08:55:52.174295 11 gc_controller.go:78] Starting apiserver lease garbage collector
I1114 08:56:02.486070 11 gc_controller.go:78] Starting apiserver lease garbage collector
I1114 09:01:44.306680 11 gc_controller.go:78] Starting apiserver lease garbage collector
I1114 09:02:39.919677 11 gc_controller.go:91] Shutting down apiserver lease garbage collector
I1114 09:03:06.557262 11 gc_controller.go:91] Shutting down apiserver lease garbage collector
I1114 09:03:23.481456 11 gc_controller.go:101] "Starting GC controller" logger="pod-garbage-collector-controller"
...
<First log about quota issues, which was the most impactful issue for us>
...
E1114 09:03:53.603391 11 resource_quota_controller.go:492] timed out waiting for quota monitor sync
...
<We start seeing GC issues>
..
E1114 09:03:53.661098 11 shared_informer.go:316] unable to sync caches for garbage collector
E1114 09:03:53.661137 11 garbagecollector.go:268] timed out waiting for dependency graph builder sync during GC sync (attempt 1)
E1114 09:04:23.767288 11 shared_informer.go:316] unable to sync caches for garbage collector
E1114 09:04:23.767308 11 garbagecollector.go:268] timed out waiting for dependency graph builder sync during GC sync (attempt 2)
E1114 09:04:53.867825 11 shared_informer.go:316] unable to sync caches for garbage collector
E1114 09:04:53.867846 11 garbagecollector.go:268] timed out waiting for dependency graph builder sync during GC sync (attempt 3)
...
<These GC errors continue until we complete reverting to 3.28.0>
...
I1114 09:08:11.234790 11 gc_controller.go:91] Shutting down apiserver lease garbage collector
I1114 09:08:25.854920 11 gc_controller.go:101] "Starting GC controller" logger="pod-garbage-collector-controller"
...
The Operation on Calico tiered policy is forbidden errors seem particular of interest. Through AWS support, they were able to track this unauthorized error to what appears to be a missing permission or a broken webhook:
Additionally, upon reviewing logs in the authorizer, the request was denied for the following reason:
EKS Access Policy: no rules authorize user "system:kube-controller-manager" with groups ["system:authenticated"] to "get" > resource "tiers.projectcalico.org" named "adminnetworkpolicy" cluster-wide
This issue resembles scenarios where list and watch calls are blocked due to a broken webhook.
This led me to this recent issue: https://github.com/projectcalico/calico/issues/9481. Could that be the root cause for our GC problems? Could a 403 on the webhook lead to these garbage collection errors and subsequent cluster degredation?
We've also seen similar errors before in previous Calico versions:
Upgrade to 3.29 should not cause cluster-wide issues
Current Behavior
Upgrade to 3.29 (after control plane restart) causes GC issues which led to resource quota and pod cleanup issues.
Possible Solution
Resolve the regression in 3.29 that we are hitting against here.
Steps to Reproduce (for bugs)
Create EKS 1.30 cluster using Calico 3.28.0
Upgrade Calico to 3.29.0, including the new CRDs
Wait for a control plane restart from Amazon (or open up a support ticket to ask them to perform one)
Witness similar error messages in the control plane logs
Context
Our goal with upgrading to 3.29.0 is to unblock our upgrade to EKS 1.31. We want to stay up-to-date with the latest versions to avoid unanticipated issues.
Your Environment
Calico version: 3.29.0
Calico dataplane (iptables, windows etc.): iptables, VXLAN
Orchestrator version (e.g. kubernetes, mesos, rkt): Kubernetes, EKS 1.30
After updating our EKS 1.30 clusters from Calico (no Tigera operator) 3.28.0 to 3.29.0, we later encountered an odd issue on our clusters. Interestingly, the errors below did not start when we performed the Calico update, but began after the first restart of the control plane (that was performed automatically by Amazon / EKS).
The impacts we noticed from this issue were caused by the failed GC. The two issues we noticed were: 1) Namespace quota usage increased monotonically and quickly hit our limits, which prevented pods from being scheduled in the namespaces which had quotas 2) Pods (ie Jobs) with containers that completed were left hanging around. This caused excess resource usage on our cluster and prevented new pods from being scheduled.
We were able to fix this by rolling back the Calico change (switch back to 3.28.0, deleting the
tier
CRD, then (on some clusters at least) restarting the control plane). For most of our clusters, it seemed like the issue went away immediately after deleting thetier
CRD:kubectl delete crd tiers.crd.projectcalico.org
.Here are some relevant logs from the issue. I've added my own comments in
<these blocks>
.The namespace quota issues can be partially explained by https://github.com/kubernetes/kubernetes/issues/98071 and the failed GCs.
The
Operation on Calico tiered policy is forbidden
errors seem particular of interest. Through AWS support, they were able to track this unauthorized error to what appears to be a missing permission or a broken webhook:The calico-apiserver logs show a similar error:
This led me to this recent issue: https://github.com/projectcalico/calico/issues/9481. Could that be the root cause for our GC problems? Could a 403 on the webhook lead to these garbage collection errors and subsequent cluster degredation?
We've also seen similar errors before in previous Calico versions:
Expected Behavior
Upgrade to 3.29 should not cause cluster-wide issues
Current Behavior
Upgrade to 3.29 (after control plane restart) causes GC issues which led to resource quota and pod cleanup issues.
Possible Solution
Resolve the regression in 3.29 that we are hitting against here.
Steps to Reproduce (for bugs)
Context
Our goal with upgrading to 3.29.0 is to unblock our upgrade to EKS 1.31. We want to stay up-to-date with the latest versions to avoid unanticipated issues.
Your Environment