Mutating webhooks - Githubissues

jabdoa2 commented 1 month ago

Implement (optional) mutating webhooks to reduce the number or restarts in new deployments when configmaps and secrets already exist.

Additional changes (based on observations during testing):

fix non-404 error handle when loading children
fix RBACs
test minimal and typical production setup
fixed reconcile loop when secrets/configmaps are missing. previously (even before my refactoring) wave would retry this with exponential-backoff
watch not (yet) existing secrets/configmaps for much faster updates when they are created. with the new watcher mechanism we do not rely on ownerReference and can watch for secrets/configmaps before they exist. this allows much faster reconciles (and allows us the get rid of polling; see previous point)
allow watching secrets/configmaps in other namespaces
when using webhooks: disable scheduling when secrets/configmaps are missing; reenable scheduling once secrets/configmaps; this prevents unneccessary container restarts.
- technically, we patch the deployment to use an invalid scheduler; this will result in a Pending pod (instead of ContainerCreating which will eventually timeout); we store the original scheduler in an annotation and later reverse this
- this allows to install helm releases and not have pods restart 2-3 times due to wave

Everything is tested. Let me know if you want any changes.

On top of #154.

jabdoa2 commented 1 month ago

I tested this on a cluster and it works fine. Fixed two issues which I noticed during testing.

jabdoa2 commented 1 month ago

This should be ready to review now. Its working with and without webhooks and we got tests to verify that both works using helm in minikube.

jabdoa2 commented 1 month ago

Let me get this straight: Before this commit a deployment referencing a non-exiteant CM would be stuck at ContainerCreating stuck in k8s, but wave could start watching for the CM immediately and add the hash once it exists. After this PR wave holds the Pod in Pending until the cm exists, so that it will never run without the annotation?

Multiple cases:

Before
Without Webhooks (still an option after the change)
With Webhooks

a) All required CMs/secrets exist before the deployment b) An optional secret is created after (a) c) At least on required CMs/secret does not exist before the deployment (typical helm install case)

a) All required CMs/secrets exist before the deployment 1 + 2 Before and Without webhooks

Deployment is created
Deployment is scheduled without annotation and pods are created
Wave edits the deployment and add the annotation
Due to Kubernetes pod lifecycle pods start up and then shutdown
A rolling update is performed by the deployment controller
All pods are recreated and restarted

With webhooks
- When deployment is created the webhook adds the hash as annotation
- Deployment is scheduled with annotation and pods are created
- Wave reconciles the deployment and starts watching all children
- No restarts occur

b) An optional secret is created

Before
- Wave will reconcile a deployment every 10 minutes. If that happens the hash is updated

2 + 3: With or Without webhooks

We watch all secrets (even not-yet-existing ones) and wave reconciles instantly

c) At least on required CMs/secret does not exist before the deployment (typical helm install case)

Before
- Deployment is scheduled without annotation and pods are created
- Pods are stuck in ContainerCreating (due to missing CM/secret)
- Wave fails to read all required children and will run into exponential backoff trying
- CM is added
- Pods start up with the latest CM and are basically up to date
- Eventually, wave reconciles (can take quite a while due to exponential backoff) and edits the deployment to add the annotation
- A rolling update is performed by the deployment controller
- All pods are recreated and restarted
Without webhooks
- Deployment is scheduled without annotation and pods are created
- Pods are stuck in ContainerCreating (due to missing CM/secret)
- Wave starts watching all children (including non-existing CMs/secrets)
- CM is added
- Pods start up with the latest CM and are basically up to date
- Wave is instantly starting a reconcile (due to our watch) and edits the deployment to add the annotation
- A rolling update is performed by the deployment controller
- All pods are recreated and restarted
With webhooks
- When deployment is created the webhook unsets the kubernetes (pod) scheduler and adds an annotates to store the previous scheduler
- Pods are created but stay pending as there is not scheduler which can schedule them
- Wave reconciles the deployment and starts watching all children (including non-existing CMs/secrets)
- CM is added
- Due to the watch wave starts reconciling the deployment. It notices that all children exist and adds the hash annotation. It also restores the previous scheduler
- A rolling update is performed by the deployment controller
- Pending pods can be instantly replaced with new pods
- All pods get created and start up
- No restarts occur

To summarize:

If you do not use webhooks wave behaves slightly better than before (when adding CMs/secrets after the deployment). However, the number of restarts is mostly the same.
With webhooks we can prevent pod restarts in common scenarios. This reduces stress on the cluster saves a lot of resources

We could decide the make the scheduler trick optional. Webhooks would still have advantages for case (a). However, for us case (c) is far more common. We often install helm charts so this is quite common.

toelke commented 1 month ago

Thank you for that detailed answer.

LGTM, but: Is the scheduler name invalid somehow reserved? Could I have a stupid cluster with a valid scheduler called invalid? If yes, I think I would prefer to call the invalid scheduler wave.pusher.com/invalid.

jabdoa2 commented 1 month ago

Thank you for that detailed answer.

LGTM, but: Is the scheduler name invalid somehow reserved? Could I have a stupid cluster with a valid scheduler called invalid? If yes, I think I would prefer to call the invalid scheduler wave.pusher.com/invalid.

Its not reserved. Let me test if that would be a valid name.

toelke commented 1 month ago

Should we do a release today or do you have any more changes in the pipeline?

jabdoa2 commented 1 month ago

Should we do a release today or do you have any more changes in the pipeline?

I would love a release today. We want this change in our clusters :-).

wave-k8s / wave

Mutating webhooks #155