Closed mikmatko closed 1 week ago
We're looking into the duplicated deploy. However, having a job in a bundle is problematic. Here is an older blog post, that suggests to choose a random name for the job: https://www.suse.com/c/rancher_blog/rancher-fleet-tips-for-kubernetes-jobs-deployment-strategies-in-continuous-delivery-scenarios/
If the job is idempotent a random name would work. We're also researching if jobs can be ignored with bundle diffs.
Thanks for the response. In all cases where I've seen this issue, I'm using the following pattern for Job naming:
metadata:
name: whatever-job-name-{{ .Release.Revision }}
Where {{ .Release.Revision }}
means the Helm revision number, which is incremented on each Helm upgrade. I believe what you're suggesting in the blog post, using something like {{ randAlphaNum 8 | lower }}
does not make any difference. The Job name is already unique, you wouldn't be able to re-deploy a Job
with the same name anyway.
When Fleet initiates Helm upgrade twice, in between, the previous Job instance is deleted. I would be happy to use a workaround which would keep the previous Job instances, but for some reason, they are automatically cleaned up (I don't have any TTL set, that would also cause issues). In a way this can be achieved by using helm.sh/resource-policy: keep
but Fleet conflicts with that too, the bundle then complains about orphaned resources.
Currently I don't think bundle diffs support ignoring Jobs. Similar related issues: https://github.com/rancher/fleet/issues/748 and https://github.com/rancher/fleet/issues/2051
My actual use case for running the Jobs is using a Helm post-upgrade
hook to notify our Jenkins instance to start running test set after a successful deployment. I'm also using Jobs to run database migrations in several backend services. However I reproduced this issue of duplicated Helm upgrades even with a simple single Job, so it doesn't seem related to using Helm hooks etc.
@mikmatko After an incredible amount of different attempts to get jobs to "not be a problem" I have been setting a helm hook to remove jobs to prevent collision between deployments. E.g.
---
apiVersion: batch/v1
kind: Job
metadata:
annotations:
helm.sh/hook: post-install,post-upgrade
helm.sh/hook-delete-policy: hook-succeeded,before-hook-creation
I added https://github.com/rancher/fleet/issues/2051#issuecomment-2382450978 to the 2.10 milestone, to implement "ignore resources".
Spent a while debugging the duplicated upgrade issue on current HEAD (41d3f52f137fceef766997fc41956ea343a0e0dd).
Here is a horrible workaround that adds a small jitter to BundleDeploymentReconciler
before it fetches the latest BundleDeployment from cluster:
diff --git a/internal/cmd/agent/controller/bundledeployment_controller.go b/internal/cmd/agent/controller/bundledeployment_controller.go
index 4d516227..e6ce4ee1 100644
--- a/internal/cmd/agent/controller/bundledeployment_controller.go
+++ b/internal/cmd/agent/controller/bundledeployment_controller.go
@@ -10,6 +10,7 @@ import (
"github.com/rancher/fleet/internal/cmd/agent/deployer/driftdetect"
"github.com/rancher/fleet/internal/cmd/agent/deployer/monitor"
fleetv1 "github.com/rancher/fleet/pkg/apis/fleet.cattle.io/v1alpha1"
+ "golang.org/x/exp/rand"
apierrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/runtime"
@@ -99,6 +100,10 @@ func (r *BundleDeploymentReconciler) Reconcile(ctx context.Context, req ctrl.Req
ctx = log.IntoContext(ctx, logger)
key := req.String()
+ // add small jitter to avoid duplicated deployments
+ rand.Seed(uint64(time.Now().UnixNano()))
+ time.Sleep(time.Duration(rand.Intn(5)+2) * time.Second)
+
// get latest BundleDeployment from cluster
bd := &fleetv1.BundleDeployment{}
err := r.Get(ctx, req.NamespacedName, bd)
With this patch, this condition https://github.com/rancher/fleet/blob/41d3f52f137fceef766997fc41956ea343a0e0dd/internal/cmd/agent/deployer/deployer.go#L102 is true. Without this patch, the condition is not true, and we then hit https://github.com/rancher/fleet/blob/41d3f52f137fceef766997fc41956ea343a0e0dd/internal/cmd/agent/deployer/deployer.go#L155 where Helm deployment occurs.
I don't really know why. Something causes the reconciler to run twice around the exact same time. If both cases fetch the BundleDeployment from the cluster at roughly the same time, then in both cases bd.Spec.DeploymentID
and bd.Status.AppliedDeploymentID
will differ, thus causing Helm deployment getting called twice.
With this patch, the small jitter ensures that something has already happened to BundleDeployment before the other request pokes at it.
Since I'm not familiar with the Fleet codebase, I may have understood something wrong. @manno @weyfonk Does this make sense to you?
In any case, in my testing so far, the above patch has worked. I have not seen duplicated deployments ever since.
The duplicate event seems to come from DriftDetect, specifically this bit: https://github.com/rancher/fleet/blob/41d3f52f137fceef766997fc41956ea343a0e0dd/internal/cmd/agent/deployer/driftdetect/driftdetect.go#L115
Causing bd.Status.SyncGeneration
to differ from bd.Spec.Options.ForceSyncGeneration
. Looks like the DriftDetect part is continuously run, even if DriftDetection is not enabled (which it's not in my case).
Looking through issues and PRs, I noted that @weyfonk had experimented with the idea https://github.com/rancher/fleet/pull/2892. Is there something blocking from skipping drift detection when it's not enabled?
Also likely that @manno https://github.com/rancher/fleet/issues/2916 would fix the issue as well. It would be great if either approach would make it to the next release, thanks :)
Looking through issues and PRs, I noted that @weyfonk had experimented with the idea https://github.com/rancher/fleet/pull/2892. Is there something blocking from skipping drift detection when it's not enabled?
That PR would break a few things, because while drift correction can be disabled, drift detection is necessary for updating statuses of resources to reflect that drift has happened. That's why drift detection is never disabled, even when drift correction is.
We merged a few PRs. Let's verify if this works.
Looking at the logs it seems to be fixed in 0.11 Drift detection still refreshes a lot, which may be avoided by storing resources, or a hash, or the helm manifest id of the last refresh. The log message regarding "updating status" is not conditional and doesn't indicate the status actually updated.
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z INFO bundledeployment.helm-deployer.install Installing helm release {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "056cc1b6-6f20-4bbd-8a59-f2d198bc853b", "commit": "aa08d490ac38de26a554377d0b07c339b68cf5a8", "dryRun": false}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment.helmSDK API Version list given outside of client only mode, this list will be ignored {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "056cc1b6-6f20-4bbd-8a59-f2d198bc853b"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z INFO bundledeployment.deploy-bundle Deployed bundle {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "056cc1b6-6f20-4bbd-8a59-f2d198bc853b", "deploymentID": "s-e2a58281f510ab11e18ff12280cd49bebf77ba5fa1b9072a350fb5968f7e0:8eb65214000b26d62f838aefe18329128f2fb59e583c51ab43890a74a4f532f3", "appliedDeploymentID": "", "release": "simple/simple-simple-manifest:1", "DeploymentID": "s-e2a58281f510ab11e18ff12280cd49bebf77ba5fa1b9072a350fb5968f7e0:8eb65214000b26d62f838aefe18329128f2fb59e583c51ab43890a74a4f532f3"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment.drift-detect Refreshing drift detection {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "056cc1b6-6f20-4bbd-8a59-f2d198bc853b", "initialResourceVersion": "1005688"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment Reconcile finished, updating the bundledeployment status {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "056cc1b6-6f20-4bbd-8a59-f2d198bc853b"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment.drift-detect Refreshing drift detection {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "49cedb5b-f68e-4a48-bf43-2dd0647b6046", "initialResourceVersion": "1005700"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment Reconcile finished, updating the bundledeployment status {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "49cedb5b-f68e-4a48-bf43-2dd0647b6046"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG helmSDK getting history for release simple-simple-manifest
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment.drift-detect Refreshing drift detection {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "c6d99d1c-5d3a-4a6c-a314-ec63f35a4f1c", "initialResourceVersion": "1005701"}
fleet-agent-0 fleet-agent 2024-10-30T13:52:51Z DEBUG bundledeployment Reconcile finished, updating the bundledeployment status {"controller": "bundledeployment", "controllerGroup": "fleet.cattle.io", "controllerKind": "BundleDeployment", "BundleDeployment": {"name":"simple-simple-manifest","namespace":"cluster-fleet-local-local-1a3d67d0a899"}, "namespace": "cluster-fleet-local-local-1a3d67d0a899", "name": "simple-simple-manifest", "reconcileID": "c6d99d1c-5d3a-4a6c-a314-ec63f35a4f1c"}
A change would deploy twice.
We made sure the status is not overwritten and switched to an event channel. Especially the new event channel controller (#2942) should help against the re-deploy observed in this issue.
Rancher Version | Fleet Version |
---|---|
v2.10.0-alpha5 | fleet:v0.11.0-beta.3 |
correctDrift: true
debug level
5fleet-agent-0
pods.Note: Intermediate jobs (created by GitJob pods) are getting removed immediately as per latest changes in Fleet.
Is there an existing issue for this?
Current Behavior
Helm upgrade is seemingly called twice on every change or force update. This seems to occur most of the time, but not always.
Logs from a downstream cluster
fleet-agent-0
pod:Working scenario, Helm deployment is called only once:
Then a bit later after pushing
Force Update
through Rancher UI (same occurs on a single new commit too):As can be seen from the logs, Helm upgrade is called twice. As a result, Fleet thinks that a
Job
object is suddenly missing:job.batch mikko-debug/mikko-debug-job-87 missing
While Fleet did two Helm upgrade operations, it seems to still think that it had done only once, hence looking for an object from the previous Helm release. This leaves the Bundle in a
modified
state.Expected Behavior
Helm upgrade is called only once per change.
Steps To Reproduce
I believe this issue can be seen with any chart, but it is more apparent if you have any
Job
in the chart. Doesn't seem to matter what options are provided infleet.yaml
etc.Environment
Logs
No response
Anything else?
No response