wave-k8s / wave

Kubernetes configuration tracking controller
Apache License 2.0
646 stars 82 forks source link

Mutating webhooks #155

Closed jabdoa2 closed 1 month ago

jabdoa2 commented 1 month ago

Implement (optional) mutating webhooks to reduce the number or restarts in new deployments when configmaps and secrets already exist.

Additional changes (based on observations during testing):

Everything is tested. Let me know if you want any changes.

On top of #154.

jabdoa2 commented 1 month ago

I tested this on a cluster and it works fine. Fixed two issues which I noticed during testing.

jabdoa2 commented 1 month ago

This should be ready to review now. Its working with and without webhooks and we got tests to verify that both works using helm in minikube.

jabdoa2 commented 1 month ago

Let me get this straight: Before this commit a deployment referencing a non-exiteant CM would be stuck at ContainerCreating stuck in k8s, but wave could start watching for the CM immediately and add the hash once it exists. After this PR wave holds the Pod in Pending until the cm exists, so that it will never run without the annotation?

Multiple cases:

  1. Before
  2. Without Webhooks (still an option after the change)
  3. With Webhooks

a) All required CMs/secrets exist before the deployment b) An optional secret is created after (a) c) At least on required CMs/secret does not exist before the deployment (typical helm install case)

a) All required CMs/secrets exist before the deployment 1 + 2 Before and Without webhooks

  1. With webhooks
    • When deployment is created the webhook adds the hash as annotation
    • Deployment is scheduled with annotation and pods are created
    • Wave reconciles the deployment and starts watching all children
    • No restarts occur

b) An optional secret is created

  1. Before
    • Wave will reconcile a deployment every 10 minutes. If that happens the hash is updated

2 + 3: With or Without webhooks

c) At least on required CMs/secret does not exist before the deployment (typical helm install case)

  1. Before

    • Deployment is scheduled without annotation and pods are created
    • Pods are stuck in ContainerCreating (due to missing CM/secret)
    • Wave fails to read all required children and will run into exponential backoff trying
    • CM is added
    • Pods start up with the latest CM and are basically up to date
    • Eventually, wave reconciles (can take quite a while due to exponential backoff) and edits the deployment to add the annotation
    • A rolling update is performed by the deployment controller
    • All pods are recreated and restarted
  2. Without webhooks

    • Deployment is scheduled without annotation and pods are created
    • Pods are stuck in ContainerCreating (due to missing CM/secret)
    • Wave starts watching all children (including non-existing CMs/secrets)
    • CM is added
    • Pods start up with the latest CM and are basically up to date
    • Wave is instantly starting a reconcile (due to our watch) and edits the deployment to add the annotation
    • A rolling update is performed by the deployment controller
    • All pods are recreated and restarted
  3. With webhooks

    • When deployment is created the webhook unsets the kubernetes (pod) scheduler and adds an annotates to store the previous scheduler
    • Pods are created but stay pending as there is not scheduler which can schedule them
    • Wave reconciles the deployment and starts watching all children (including non-existing CMs/secrets)
    • CM is added
    • Due to the watch wave starts reconciling the deployment. It notices that all children exist and adds the hash annotation. It also restores the previous scheduler
    • A rolling update is performed by the deployment controller
    • Pending pods can be instantly replaced with new pods
    • All pods get created and start up
    • No restarts occur

To summarize:

We could decide the make the scheduler trick optional. Webhooks would still have advantages for case (a). However, for us case (c) is far more common. We often install helm charts so this is quite common.

toelke commented 1 month ago

Thank you for that detailed answer.

LGTM, but: Is the scheduler name invalid somehow reserved? Could I have a stupid cluster with a valid scheduler called invalid? If yes, I think I would prefer to call the invalid scheduler wave.pusher.com/invalid.

jabdoa2 commented 1 month ago

Thank you for that detailed answer.

LGTM, but: Is the scheduler name invalid somehow reserved? Could I have a stupid cluster with a valid scheduler called invalid? If yes, I think I would prefer to call the invalid scheduler wave.pusher.com/invalid.

Its not reserved. Let me test if that would be a valid name.

toelke commented 1 month ago

Should we do a release today or do you have any more changes in the pipeline?

jabdoa2 commented 1 month ago

Should we do a release today or do you have any more changes in the pipeline?

I would love a release today. We want this change in our clusters :-).