grafana / rollout-operator

Kubernetes Rollout Operator
Apache License 2.0
140 stars 19 forks source link

Prepare downscale webhook #47

Closed JordanRushing closed 1 year ago

JordanRushing commented 1 year ago

This PR follows https://github.com/grafana/rollout-operator/pull/25 and implements a prep_downscale mutating admission webhook that, after receiving a request that scales down a resource, sends a HTTP POST to pods in a StatefulSet/Deployment/ReplicaSet that are labeled with grafana.com/prep-downscale: true.

The endpoint targets for the requests are crafted by extracting and combining the grafana.com/prep-downscale-http-path and grafana.com/prep-downscale-http-port labels along with the diff in the number of replicas.

An example scenario for when this is useful:

colega commented 1 year ago

👋 Thank you for taking this task.

Did a quick first review, and have some feedback:

CLAassistant commented 1 year ago

CLA assistant check
All committers have signed the CLA.

MichelHollands commented 1 year ago

This comment still applies:

We should add an annotation to the downscaled resource saying when it was downscaled last time, so we can validate that and disallow downscaling other zones when one of the zones was recently downscaled.

We're doing this to make possible the automation of downscaling, and we need to have the minimum safety net here that will prevent zones from being downscaled with a too small interval.

@colega By downscaled resource you mean the stateful set?

colega commented 1 year ago

@MichelHollands

@colega By downscaled resource you mean the stateful set?

Yes, I mean the statefulset. We have 3 zones, and we need to wait a safety period of time between downscaling each one of them (the three statefulsets sharing the zone label).

Maybe this isn't a requirement for Loki and we should implement it optional for Mimir? Anyway, I'd define the waiting period as a label on the statefulset too (grafana.com/min-time-between-zones-downscale: 12h)

(OMG it's been a long day, and I edited your comment instead of replying :facepalm:)

MichelHollands commented 1 year ago

Yes, I mean the statefulset. We have 3 zones, and we need to wait a safety period of time between downscaling each one of them (the three statefulsets sharing the zone label).

Maybe this isn't a requirement for Loki and we should implement it optional for Mimir? Anyway, I'd define the waiting period as a label on the statefulset too (grafana.com/min-time-between-zones-downscale: 12h)

Yeah, in Loki we don't need this time between downscaling. The rollout operator will make sure only 1 stateful set at a time is scaled up or down. The Loki ingesters will have had the prep-downscale endpoint called. So this works for us.

I'll add something similar to https://github.com/grafana/mimir/pull/4718/files to the Loki ingesters. Then we won't need the pod annotations anymore (which wouldn't work reliably anyways as you mentioned).

colega commented 1 year ago

The rollout operator will make sure only 1 stateful set at a time is scaled up or down.

The rollout operator does not participate in the scaling down operation. (That is why we're doing this as a webhook the manifest change rather than coordinating from the rollout-operator conciliation).

MichelHollands commented 1 year ago

@colega I think I've addressed all the review comments. Can you have another look? The CI failure seems unrelated: the pod is pending. Can you rerun the CI? I don't have permission.

MichelHollands commented 1 year ago

Looks like this is getting close to being merged. Remember to add an entry to the CHANGELOG (and sorry for the previous dependency update conflict!)

The CHANGELOG has been updated as well as the README.md. The e2e tests were fixed. However in the CI the last test fails occasionally. I'm not sure how to fix it at the moment.

MichelHollands commented 1 year ago

The TestExpiringCertificate was a bit flaky. After increasing one timeout there it seems a lot more stable. Can someone merge this? I don't have permission. The same might apply for making a new release.

colega commented 1 year ago

Thank you!