Closed kkaempf closed 1 year ago
@kkaempf was one of the resources managed by the single gitrepo itself another GitRepo crd by any chance? Also was there also a large amount of network traffic between the master nodes? (specifically between etcd nodes)?
@kkaempf was one of the resources managed by the single gitrepo itself another GitRepo crd by any chance? Also was there also a large amount of network traffic between the master nodes? (specifically between etcd nodes)?
I didn't encounter this problem. I just copied it here from another system. 🤷🏻♂️
/cc @moio
This is related to https://github.com/rancher/fleet/pull/1485
This is related to #1485
On top of that PR, https://github.com/rancher/wrangler/pull/305 and https://github.com/rancher/fleet/pull/1607 help.
Still more is needed to fix fully, and that will be discussed in-person next week.
@moio @manno - can we close this issue ? If not, what's missing ?
With #1738 completed (#1809 merged) it is my understanding we got most of the solution for this problem.
I would like to have an answer to this follow-up question and ideally try the solution out with the affected customer (either via a new fleet version or a debug image).
AFAIK customer is still using a debug image with #1609 as the temporary workaround, now superseded by #1809.
We implemented the cache and are planning one more fix to retrieving helm secrets. Please contact @raulcabello next week.
Internal reference: SURE-6125
Issue description:
In one of the downstream clusters, all of the master nodes have been failing one by one consuming all CPU, Memory and I/O. It has OPA Gatekeeper, Fleet and custom operators for creating network policies. They have one gitRepo which handles 68 bundles and creates 565 Resources. Scaling down the fleet-agent to zero fixes the issue, and enabling it again causes the issues. Checked the number of requests in the API server, 134584 out of 206481 requests arises from system:serviceaccount:cattle-fleet-system:fleet-agent
Business impact:
Not able to use CI/CD as fleet-agent is causing issues when enabled
Troubleshooting steps:
1) One of the 3 Downstream Master machines starts consuming all CPU and RAM. 2) The volume of IO increases a lot, up to 1Gb/s 3) When it occurs, the rke2-server service fails and restarts many times. 3) If the machine was the etcd leader, This leads to the leader election process. 4) With no intervention, after a few hours, a second Downstream Master also consumes all CPU and RAM. 5) Then etcd cluster fails because 2 of the 3 machines are down, and the downstream cluster becomes unavailable under the rancher UI.
Repro steps:
Scaled down to 0 the deployment of fleet-agent, no issues for a day Scaled up fleet-agent to 1 replica, and it failed 30 minutes and even faster later
Workaround:
Is workararound available and implemented? no
Actual behavior:
Downstream cluster fails when fleet agents are enabled
Expected behavior:
Cluster should work flawlessly with fleet-agent enabled
Files, logs, traces:
Count of users that made the request
Request URI from system:serviceaccount:cattle-fleet-system:fleet-agent
Additional notes: Debug logs from fleet agent is attached