[SURE-6125] Master node failures due to fleet high resource usage

kkaempf commented 1 year ago

Internal reference: SURE-6125

Issue description:

In one of the downstream clusters, all of the master nodes have been failing one by one consuming all CPU, Memory and I/O. It has OPA Gatekeeper, Fleet and custom operators for creating network policies. They have one gitRepo which handles 68 bundles and creates 565 Resources. Scaling down the fleet-agent to zero fixes the issue, and enabling it again causes the issues. Checked the number of requests in the API server, 134584 out of 206481 requests arises from system:serviceaccount:cattle-fleet-system:fleet-agent

Business impact:

Not able to use CI/CD as fleet-agent is causing issues when enabled

Troubleshooting steps:

1) One of the 3 Downstream Master machines starts consuming all CPU and RAM. 2) The volume of IO increases a lot, up to 1Gb/s 3) When it occurs, the rke2-server service fails and restarts many times. 3) If the machine was the etcd leader, This leads to the leader election process. 4) With no intervention, after a few hours, a second Downstream Master also consumes all CPU and RAM. 5) Then etcd cluster fails because 2 of the 3 machines are down, and the downstream cluster becomes unavailable under the rancher UI.

Repro steps:

Scaled down to 0 the deployment of fleet-agent, no issues for a day Scaled up fleet-agent to 1 replica, and it failed 30 minutes and even faster later

Workaround:

Is workararound available and implemented? no

Actual behavior:

Downstream cluster fails when fleet agents are enabled

Expected behavior:

Cluster should work flawlessly with fleet-agent enabled

Files, logs, traces:

Count of users that made the request

2568 "system:node:il0prmb00593"
3093 "system:serviceaccount:trident-system:trident-operator"
4424 "system:serviceaccount:calico-apiserver:calico-apiserver"
5372 "system:admin"
5771 "rke2-cloud-controller-manager"
6470 "system:apiserver"
6574 "system:kube-scheduler"
8052 "system:serviceaccount:kube-system:generic-garbage-collector"
8235 "system:kube-controller-manager"
8247 "system:serviceaccount:kube-system:resourcequota-controller"
10215 "system:serviceaccount:cattle-logging-system:rancher-logging"
134496 "system:serviceaccount:cattle-fleet-system:fleet-agent"

Request URI from system:serviceaccount:cattle-fleet-system:fleet-agent

1975 "/apis/logging.banzaicloud.io/v1beta1/clusterflows?allowWatchBookmarks=true&resourceVersion=2214745&watch=true"
1976 "/api/v1/namespaces?allowWatchBookmarks=true&resourceVersion=5886&watch=true"
1976 "/api/v1/persistentvolumeclaims?allowWatchBookmarks=true&resourceVersion=96417993&watch=true"
1976 "/api/v1/resourcequotas?allowWatchBookmarks=true&resourceVersion=96367099&watch=true"
1976 "/api/v1/services?allowWatchBookmarks=true&resourceVersion=104025240&watch=true"
1976 "/apis/admissionregistration.k8s.io/v1/mutatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=96363739&watch=true"
1976 "/apis/admissionregistration.k8s.io/v1/validatingwebhookconfigurations?allowWatchBookmarks=true&resourceVersion=104025410&watch=true"
1976 "/apis/apiextensions.k8s.io/v1/customresourcedefinitions?allowWatchBookmarks=true&resourceVersion=2213700&watch=true"
1976 "/apis/apps/v1/statefulsets?allowWatchBookmarks=true&resourceVersion=104006438&watch=true"
1976 "/apis/cis.cattle.io/v1/clusterscanbenchmarks?allowWatchBookmarks=true&resourceVersion=65627187&watch=true"
1976 "/apis/cis.cattle.io/v1/clusterscanprofiles?allowWatchBookmarks=true&resourceVersion=65627201&watch=true"
1976 "/apis/cis.cattle.io/v1/clusterscans?allowWatchBookmarks=true&resourceVersion=82780008&watch=true"
1976 "/apis/config.gatekeeper.sh/v1alpha1/configs?allowWatchBookmarks=true&resourceVersion=19791262&watch=true"
1976 "/apis/logging.banzaicloud.io/v1beta1/clusteroutputs?allowWatchBookmarks=true&resourceVersion=7676714&watch=true"
1976 "/apis/logging.banzaicloud.io/v1beta1/flows?allowWatchBookmarks=true&resourceVersion=2214747&watch=true"
1976 "/apis/logging.banzaicloud.io/v1beta1/loggings?allowWatchBookmarks=true&resourceVersion=65605326&watch=true"
1976 "/apis/monitoring.coreos.com/v1/alertmanagers?allowWatchBookmarks=true&resourceVersion=65610070&watch=true"
1976 "/apis/monitoring.coreos.com/v1/prometheuses?allowWatchBookmarks=true&resourceVersion=96411195&watch=true"
1976 "/apis/monitoring.coreos.com/v1/prometheusrules?allowWatchBookmarks=true&resourceVersion=65610117&watch=true"
1976 "/apis/monitoring.coreos.com/v1/servicemonitors?allowWatchBookmarks=true&resourceVersion=65610122&watch=true"
1976 "/apis/mutations.gatekeeper.sh/v1beta1/assignmetadata?allowWatchBookmarks=true&resourceVersion=103993976&watch=true"
1976 "/apis/networking.k8s.io/v1/ingresses?allowWatchBookmarks=true&resourceVersion=104007495&watch=true"
1976 "/apis/networking.k8s.io/v1/networkpolicies?allowWatchBookmarks=true&resourceVersion=89644199&watch=true"
1976 "/apis/policy/v1/poddisruptionbudgets?allowWatchBookmarks=true&resourceVersion=104006738&watch=true"
1976 "/apis/policy/v1beta1/podsecuritypolicies?allowWatchBookmarks=true&resourceVersion=2215154&watch=true"
1976 "/apis/storage.k8s.io/v1/storageclasses?allowWatchBookmarks=true&resourceVersion=96407007&watch=true"
1976 "/apis/templates.gatekeeper.sh/v1beta1/constrainttemplates?allowWatchBookmarks=true&resourceVersion=103994049&watch=true"
1976 "/apis/trident.netapp.io/v1/tridentbackendconfigs?allowWatchBookmarks=true&resourceVersion=96451682&watch=true"
1976 "/apis/trident.netapp.io/v1/tridentorchestrators?allowWatchBookmarks=true&resourceVersion=104023562&watch=true"
3904 "/api/v1/namespaces/cattle-fleet-system/configmaps/fleet-agent-lock"
5856 "/apis/coordination.k8s.io/v1/namespaces/cattle-fleet-system/leases/fleet-agent-lock"

Additional notes: Debug logs from fleet agent is attached

davidborg-tech commented 1 year ago

@kkaempf was one of the resources managed by the single gitrepo itself another GitRepo crd by any chance? Also was there also a large amount of network traffic between the master nodes? (specifically between etcd nodes)?

kkaempf commented 1 year ago

@kkaempf was one of the resources managed by the single gitrepo itself another GitRepo crd by any chance? Also was there also a large amount of network traffic between the master nodes? (specifically between etcd nodes)?

I didn't encounter this problem. I just copied it here from another system. 🤷🏻‍♂️

kkaempf commented 1 year ago

/cc @moio

manno commented 1 year ago

This is related to https://github.com/rancher/fleet/pull/1485

moio commented 1 year ago

This is related to #1485

On top of that PR, https://github.com/rancher/wrangler/pull/305 and https://github.com/rancher/fleet/pull/1607 help.

Still more is needed to fix fully, and that will be discussed in-person next week.

kkaempf commented 1 year ago

@moio @manno - can we close this issue ? If not, what's missing ?

moio commented 1 year ago

With #1738 completed (#1809 merged) it is my understanding we got most of the solution for this problem.

I would like to have an answer to this follow-up question and ideally try the solution out with the affected customer (either via a new fleet version or a debug image).

AFAIK customer is still using a debug image with #1609 as the temporary workaround, now superseded by #1809.

manno commented 1 year ago

We implemented the cache and are planning one more fix to retrieving helm secrets. Please contact @raulcabello next week.

rancher / fleet