jetstack / navigator

Managed Database-as-a-Service (DBaaS) on Kubernetes
Apache License 2.0
271 stars 31 forks source link

Allow pausing clusters #301

Closed kragniz closed 6 years ago

kragniz commented 6 years ago

What this PR does / why we need it: this allows users to 'pause' a cluster, making navigator skip syncing. This allows for manual intervention in case something goes wrong.

Release note:

Allow pausing clusters
munnerz commented 6 years ago

Interestingly, the deployment controller does not pause scale actions when spec.paused is set. It also does not log an event about it, but instead sets a DeploymentProgressing condition (in this function: https://github.com/kubernetes/kubernetes/blob/1102fd0dcbc4a408045e8d1bc42f056909e72322/pkg/controller/deployment/sync.go#L75).

Given the nature of a scale event in an ES or C cluster, I think we should* also pause scale events in the case of our pause feature, but maybe we want to borrow the behaviour around setting a condition/logging an event (i.e. don't log an event, instead set a condition on the resource).

We've not really utilised conditions up until now, but they do provide a nice and clear way to provide insight into the state of a resource (and save hammering the events API with duplicate events). An event probably shouldn't be used to report the state of objects, but instead used to report information on an action (perhaps not necessarily just 'Actions' as in our definition of an action) that has been taken (whether it has succeeded, or conversely if it errored then reporting some information here too). There may be some information that makes sense to duplicate as both a condition and an event, but I'm not sure. I guess it depends on the nature of the action.

wallrj commented 6 years ago

Some interesting thoughts here on PauseCondition: https://github.com/kubernetes/kubernetes/pull/58465#issuecomment-359593540

kragniz commented 6 years ago

/retest

wallrj commented 6 years ago

The E2E test failures all look like unrelated flakes: ElasticSearch pod didn't become ready, Cassandra node didn't become ready....maybe because the test infrastructure was overloaded.

As discussed, it'd be worth adding some E2E tests for cluster pausing though.

munnerz commented 6 years ago

3/3 failed tests, even if they are on a flake, isn't a healthy state to be in. FWIW, our test infrastructure is quite over-provisioned at the moment so I doubt it's down to overload.

Perhaps we've got some other component here that's flakey?

jetstack-bot commented 6 years ago

@kragniz: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
navigator-e2e-v1-10 c5c798e137e90245dfef3ba924d29dffbae814f9 link /test e2e v1.10

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/devel/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
wallrj commented 6 years ago

/lgtm /approve

jetstack-bot commented 6 years ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wallrj

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/jetstack/navigator/blob/master/OWNERS)~~ [wallrj] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment