[SURE-3711] Long time for changes to propagate from GitOps to cluster state

mirzak commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

We are experiencing very slow Bundle deployments in one our largest clusters and we can see that most time is spent in WaitApplied, and we see up to 10-15 minutes for a Bundle to transition back to Ready after modification.

When we measure time, we ignore any actions that require fetching container images etc, e.g changing replicaCount: 0 takes 10-15 minutes to propagate from PR merge in our GitOps repo, until it is actually applied in the cluster.

It is not clear at the moment where the time is wasted, but we suspect it is the fleet-agent in relation to the scale (number of Bundles) we have.

We have several smaller clusters that do not experience the same problem.

Expected Behavior

Would expect changes to propagate in less then 60 seconds assuming they do not require to fetch new images etc. This should be reasonable?

This is from PR merge on GitOps repo, to cluster state update, and bundle in Ready state again.

Steps To Reproduce

Not clear. This is most likely related to our environment and scale. See more in next section.

Environment

- Architecture: amd64
- Fleet Version: v0.3.11 2958e9b
- Cluster:
  - Provider: Rancher + k3s
  - Options:
      - 450 nodes
      - 1260 Bundles
      - 4400 pods
      - 5697 resources
  - Kubernetes Version: v1.22.12+k3s1

Logs

As we have a very large number of Bundles, I will defer to pasting output that might not be relevant.

Happy to provide specific logs if deemed necessary to debug this issue further

Anything else?

Based on our observations, we do not see to be limited by CPU/RAM/Network bandwidth on our "master" nodes where fleet-agent is running we really have monster of machines :)

mirzak commented 1 year ago

Would appreciate to get some feedback from the fleet team on this issue.

manno commented 1 year ago

Interesting. Are you using webhooks or do you set the pollingInterval for git repos? And do you use a "rolloutStrategy" with partitions?

The status of the affected bundle would be interesting. Bundle resources are created by the gitjob. Then the fleet-controller creates a a bundledeployment and a content resource for each bundle/cluster combination. It would be interesting to see if those are created correctly for deployments that are stuck in WaitApplied. The bundledeployment references a content resource, the reference should be to the latest one.

Only then the agent would be able to deploy the bundle on a downstream cluster.

The calculation of WaitApplied is very involved and called from several places: https://github.com/rancher/fleet/blob/master/pkg/summary/summary.go#L16 Maybe a tuned rollout strategy can help? Are there any failed/modified bundles?

mirzak commented 1 year ago

Are you using webhooks or do you set the pollingInterval for git repos?

We are using pollingInterval, every 5 seconds I belive.

And do you use a "rolloutStrategy" with partitions?

No, so defaults should apply.

The status of the affected bundle would be interesting. Bundle resources are created by the gitjob.

The mentioned time (10-15 minutes) is when monitoring the Bundle resource.

Then the fleet-controller creates a a bundledeployment and a content resource for each bundle/cluster combination. It would be interesting to see if those are created correctly for deployments that are stuck in WaitApplied.

Will have a closer look at the BundleDeployment and come back to you.

Maybe a tuned rollout strategy can help?

cluster partitioning?

Let me describe our setup in a bit more detail.

We have one Rancher/Fleet instance managing 2 cluster groups with a total of 10 clusters.

eks - AWS managed clusters - we have not had issues here and faily low volume so lets not focus on this one
- 4 clusters
k3s - self hosted k3s clusters
- 6 clusters

We have 6 GitRepo objects (one for each cluster).

We also a "generic" GitRepo that applies changes to all cluster.

The k3s clusters:

Name	Resources	Nodes	Deployments
Cluster 1	425	7	27
Cluster 2	3771	436	1073
Cluster 3	1561	83	342
Cluster 4	1108	171	220
Cluster 5	2183	231	1096
Cluster 6	348	8	29

This means that the same fleet-controller manages all these clusters.

We are only experiencing problems with Cluster 2, this is why I would suspect that it is downstream problem, but we have not been able to pinpoint it.

Are there any failed/modified bundles?

Yes, there typically are as the cluster is in a "fluid state". We have made an attempt in trying to remove any failing/modified bundles to make the "cluster green", but we did not see any impacts in reducing deployment times.

kkaempf commented 1 year ago

SURE-3711

manno commented 3 months ago

Let's install a cluster with a about a hundred nodes and try to replicate this.

manno commented 5 days ago

Yes, there typically are as the cluster is in a "fluid state". We have made an attempt in trying to remove any failing/modified bundles to make the "cluster green", but we did not see any impacts in reducing deployment times.

That's interesting. I would have expected that this reduces the overall number of events. Probably not enough.

In Fleet < 0.10 the agent reconciler, who installs the bundles on clusters, only has 5 workers to accept events. When a gitrepo creates say 50 bundles for that agent, it will work on 5 in parallel. (Resources change more than once during an installation, so it's a bit worse.)

Fleet 0.10.3/0.10.4 will increase this number. We plan to make it configurable in the future.

manno commented 5 days ago

@mirzak In case you're still around for this old issue, can you retry once 2.9.3 (Fleet >=0.10.3) is released?

mirzak commented 5 days ago

I no longer work on that particular project but @tmartensson might be interested.

rancher / fleet