Open mirzak opened 1 year ago
Would appreciate to get some feedback from the fleet team on this issue.
Interesting. Are you using webhooks or do you set the pollingInterval
for git repos?
And do you use a "rolloutStrategy" with partitions?
The status of the affected bundle would be interesting. Bundle resources are created by the gitjob. Then the fleet-controller creates a a bundledeployment and a content resource for each bundle/cluster combination. It would be interesting to see if those are created correctly for deployments that are stuck in WaitApplied. The bundledeployment references a content resource, the reference should be to the latest one.
Only then the agent would be able to deploy the bundle on a downstream cluster.
The calculation of WaitApplied is very involved and called from several places: https://github.com/rancher/fleet/blob/master/pkg/summary/summary.go#L16 Maybe a tuned rollout strategy can help? Are there any failed/modified bundles?
Are you using webhooks or do you set the pollingInterval for git repos?
We are using pollingInterval
, every 5 seconds I belive.
And do you use a "rolloutStrategy" with partitions?
No, so defaults should apply.
The status of the affected bundle would be interesting. Bundle resources are created by the gitjob.
The mentioned time (10-15 minutes) is when monitoring the Bundle resource.
Then the fleet-controller creates a a bundledeployment and a content resource for each bundle/cluster combination. It would be interesting to see if those are created correctly for deployments that are stuck in WaitApplied.
Will have a closer look at the BundleDeployment and come back to you.
Maybe a tuned rollout strategy can help?
cluster partitioning?
Let me describe our setup in a bit more detail.
We have one Rancher/Fleet instance managing 2 cluster groups with a total of 10 clusters.
We have 6 GitRepo objects (one for each cluster).
We also a "generic" GitRepo that applies changes to all cluster.
The k3s clusters:
Name | Resources | Nodes | Deployments |
---|---|---|---|
Cluster 1 | 425 | 7 | 27 |
Cluster 2 | 3771 | 436 | 1073 |
Cluster 3 | 1561 | 83 | 342 |
Cluster 4 | 1108 | 171 | 220 |
Cluster 5 | 2183 | 231 | 1096 |
Cluster 6 | 348 | 8 | 29 |
This means that the same fleet-controller
manages all these clusters.
We are only experiencing problems with Cluster 2
, this is why I would suspect
that it is downstream problem, but we have not been able to pinpoint it.
Are there any failed/modified bundles?
Yes, there typically are as the cluster is in a "fluid state". We have made an attempt in trying to remove any failing/modified bundles to make the "cluster green", but we did not see any impacts in reducing deployment times.
SURE-3711
Let's install a cluster with a about a hundred nodes and try to replicate this.
Yes, there typically are as the cluster is in a "fluid state". We have made an attempt in trying to remove any failing/modified bundles to make the "cluster green", but we did not see any impacts in reducing deployment times.
That's interesting. I would have expected that this reduces the overall number of events. Probably not enough.
In Fleet < 0.10 the agent reconciler, who installs the bundles on clusters, only has 5 workers to accept events. When a gitrepo creates say 50 bundles for that agent, it will work on 5 in parallel. (Resources change more than once during an installation, so it's a bit worse.)
Fleet 0.10.3/0.10.4 will increase this number. We plan to make it configurable in the future.
@mirzak In case you're still around for this old issue, can you retry once 2.9.3 (Fleet >=0.10.3) is released?
I no longer work on that particular project but @tmartensson might be interested.
Is there an existing issue for this?
Current Behavior
We are experiencing very slow Bundle deployments in one our largest clusters and we can see that most time is spent in
WaitApplied
, and we see up to 10-15 minutes for a Bundle to transition back toReady
after modification.When we measure time, we ignore any actions that require fetching container images etc, e.g changing
replicaCount: 0
takes 10-15 minutes to propagate from PR merge in our GitOps repo, until it is actually applied in the cluster.It is not clear at the moment where the time is wasted, but we suspect it is the fleet-agent in relation to the scale (number of Bundles) we have.
We have several smaller clusters that do not experience the same problem.
Expected Behavior
Would expect changes to propagate in less then 60 seconds assuming they do not require to fetch new images etc. This should be reasonable?
This is from PR merge on GitOps repo, to cluster state update, and bundle in Ready state again.
Steps To Reproduce
Not clear. This is most likely related to our environment and scale. See more in next section.
Environment
Logs
Anything else?
Based on our observations, we do not see to be limited by CPU/RAM/Network bandwidth on our "master" nodes where fleet-agent is running we really have monster of machines :)