wave-k8s / wave

Kubernetes configuration tracking controller
Apache License 2.0
646 stars 82 forks source link

Wave causes an extra, short lived replica to be created in every deployment #111

Closed andrew-farries closed 2 years ago

andrew-farries commented 2 years ago

Wave is causing an extra, short lived replica to be added to every deployment to which the wave.pusher.com/update-on-config-change: "true" annotation is added.


Deployment manifest ```yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: wave.pusher.com/update-on-config-change: "true" labels: app: bbox name: bbox spec: replicas: 1 selector: matchLabels: app: bbox template: metadata: labels: app: bbox spec: containers: - command: - /bin/sh - -c - while true; do sleep 3600; done image: busybox name: busybox ```

Steps to replicate this:

First, install wave:

helm install wave wave-k8s/wave

Output of helm ls

NAME    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART           APP VERSION
wave    wave-test       1               2021-09-16 09:46:55.862965 +0100 BST    deployed        wave-2.0.0      v0.5.0

Apply the yaml file above:

kubectl apply -f deployment.yaml

I now see two pods start-up, one of which almost immediately enters a terminating state:

NAME                        READY   STATUS        RESTARTS   AGE   LABELS
bbox-57689566cf-8tc7p       1/1     Running       0          8s    app=bbox,pod-template-hash=57689566cf
bbox-68d685577f-p8s89       1/1     Terminating   0          8s    app=bbox,pod-template-hash=68d685577f

The pod in the Terminating state does not have the wave annotation; the Running one does.

Wave logs:

I0916 10:00:24.191390       1 handler.go:108] wave "level"=0 "msg"="Updating instance hash"  "hash"="100444e91862dd77d7ebe29f050c1e9a7f357c771e1a7b7650aae27e6a3a031d" "name"="bbox" "namespace"="wave-test"

Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0", GitCommit:"e19964183377d0ec2052d1f1fa930c4d7575bd50", GitTreeState:"clean", BuildDate:"2020-08-26T14:30:33Z", GoVersion:"go1.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.16", GitCommit:"7a98bb2b7c9112935387825f2fce1b7d40b76236", GitTreeState:"clean", BuildDate:"2021-02-17T11:52:32Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

If I increase the replica count of the deployment to 2 I get four pods starting, two of which immediately enter a terminating state; as before the two 'extras' don't have the the wave annotation.

Removing the wave.pusher.com/update-on-config-change: "true" annotation from the deployment results in a deployment with no short-lived replicas, as expected.

wonderhoss commented 2 years ago

Could you please share the contents of your deployent.yaml?

The behaviour you observe is expected if the Deployment and any of the ConfigMap or Secrets it consumes are updated at the same time. Could that be happening here?

andrew-farries commented 2 years ago

The content of deployment.yaml is in the issue (expand the Deployment manifest section).

The deployment is not actually referencing any Configmaps or Secrets - merely adding the wave annotation to the deployment is enough to trigger this strange behaviour.

wonderhoss commented 2 years ago

I think I see now. This should only happen the first time the wave.pusher.com/update-on-config-change: "true" annotation is added. In that case it is expected. Adding the annotation causes a new ReplicaSet to be created (this is the case for any modification made to a Deployment. Now with the annotation present, wave will take over managing the Deployment and add its own annotation containing a hash of the current configuration. This is used to track any changes to this configuration going forward.

In your case, the Deployment does not actually reference any config objects, so this is somewhat meaningless. If it did consume a ConfigMap, for example, any change to its contents would then cause wave to update the hash annotation and thus trigger a rollout.

The additional Pods being created if wave is enabled on a Deployment that already exists should not cause any problems and is unavoidable.

andrew-farries commented 2 years ago

Thanks for your quick responses on this.

It's good to know that what I'm seeing is expected, but I'm not sure I agree that this behaviour is harmless, at least for my use case.

I'm working with deployments of 1 replica, where each replica is stateful database and can be very resource heavy in terms of memory and CPU requests. Scheduling an extra pod, even for a short time is probably not acceptable in that case. That's before factoring in the chance that both pods land on the same node and race/overwrite each other on the same PVC at startup resulting in neither of them being able to start successfully.

If this behaviour is unavoidable, I guess I'll have to look elsewhere for a solution for rolling these pods in response to secret mount changes.

wonderhoss commented 2 years ago

This certainly is one of the challenges of running large, stateful workloads on Kubernetes.

The way wave is designed, the behaviour you have observed is sadly indeed not avoidable.

That said, it won't occur in steady-state but only when a Deployment is first created or the wave annotation first added. After that initial double rollout, you should expect to see exactly one per config change, although I appreciate that may not be good enough for your use case.

andrew-farries commented 2 years ago

Unfortunately, in the system I'm working with the creation/deletion of these deployments is highly dynamic, so this initial double rollout probably isn't going to work for us.

Thanks for your help on this 👍

sherifabdlnaby commented 2 years ago

@gargath What if Wave added its hash annotation only after the first time a secret/configmap is updated?

github-actions[bot] commented 2 years ago

This issue has been inactive for 60 days. If the issue is still relevant please comment to re-activate the issue. If no action is taken within 7 days, the issue will be marked closed.

jabdoa2 commented 1 month ago

For anyone interested: This has now been fixed. With Wave 0.6+ you can enable webhook which eleminate an additional short-lived replica in almost any case.