Open philomory opened 3 years ago
Looks like we are running into this error 2years later with rancher 2.7.1. Having 1 downed Cluster block the whole process is what we were trying to circumvent with fleet. Any idea or timeline? Currently i have to change a selector to remove the cluster from the group.
This might be related to default values in the rollout strategy. The defaults are documented in the fleet.yaml reference.
Let's test if this still happens on 2.9.1
It seems to still be happening on 2.9.1-rc3
.
Adding some notes about how is it observed on this version:
On step 7 state is either Not Ready
or Modified
. Nevertheless, an error message is displayed:
Error log:
Modified(3) [Bundle repo-r-test-bundle]; deployment.apps test-bundle/test modified {"spec":{"template":{"spec":{"containers":[{"image":"paulbouwer/hello-kubernetes:1.10.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStart":{"exec":{"command":["sleep","2"]}}},"name":"test","resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}]}}}}
After step 10 State is Wait Applied
directly:
After 'fix' from step 12 state of Git Repo is Wait Applied
, yet, similar result to the originally described happens:
Error:
WaitApplied(1) [Bundle repo-r-test-bundle]; deployment.apps test-bundle/test modified {"spec":{"template":{"spec":{"containers":[{"image":"paulbouwer/hello-kubernetes:1.10.1","imagePullPolicy":"IfNotPresent","lifecycle":{"preStart":{"exec":{"command":["sleep","2"]}}},"name":"test","resources":{},"terminationMessagePath":"/dev/termination-log","terminationMessagePolicy":"File"}]}}}}
When working on this we should
This might be related to default values in the rollout strategy. The defaults are documented in the fleet.yaml reference.
Rollout strategy does not seem to be the culprit here, as setting maxUnavailablePartitions
to 100%
bundle-wide does not change anything.
Reproduced this with Fleet standalone (latest main
).
The issue here is that since a bundle deployment's status is updated by the agent living in the bundle deployment's target cluster, a bundle deployment targeting a downstream cluster will not have its status updated once that cluster is offline.
With status data being propagated from bundle deployment upwards to bundles and GitRepo
s, then to clusters and cluster groups, this explains why those resources have their statuses still showing an outdated modified status.
A solution for this could consist in watching a Fleet Cluster
resource's agent lastSeen
timestamp, which is also updated by the agent from the downstream cluster, and will therefore not be updated anymore once the cluster is offline. Fleet would then need to update statuses of all bundle deployments in that cluster once more than $threshold
has elapsed since that lastSeen
timestamp, for those status updates to then be propagated to other resources.
$threshold
could be a hard-coded value or left configurable, with a sensible default value (eg. 15 or more minutes).
If any clusters are offline/unavailable, the status of Bundles that get deployed to those clusters can get stuck with misleading/confusing error messages.
Steps to reproduce:
Create a git repository containing the following code:
ErrApplied
, with an error message similar toerror validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0].lifecycle): unknown field "preStart" in io.k8s.api.core.v1.Lifecycle'
.Error validating "": error validating data: ValidationError(Deployment.spec.template.spec.containers[0].lifecycle): unknown field "preStart" in io.k8s.api.core.v1.Lifecycle
, even though the actual repository no longer contains any reference to apreStart
field.It is worth noting that, if step 10 is skipped - so that the commit in step 12 (which fixes the error) is the first commit to the repo after cluster A goes offline - then in step 12 the BundleDeployment for A will go to a "Wait Applied" state rather than being stuck in the error state.