Closed kragniz closed 6 years ago
Interestingly, the deployment controller does not pause scale actions when spec.paused
is set. It also does not log an event about it, but instead sets a DeploymentProgressing
condition (in this function: https://github.com/kubernetes/kubernetes/blob/1102fd0dcbc4a408045e8d1bc42f056909e72322/pkg/controller/deployment/sync.go#L75).
Given the nature of a scale event in an ES or C cluster, I think we should* also pause scale events in the case of our pause
feature, but maybe we want to borrow the behaviour around setting a condition/logging an event (i.e. don't log an event, instead set a condition on the resource).
We've not really utilised conditions up until now, but they do provide a nice and clear way to provide insight into the state of a resource (and save hammering the events API with duplicate events). An event probably shouldn't be used to report the state of objects, but instead used to report information on an action (perhaps not necessarily just 'Actions' as in our definition of an action) that has been taken (whether it has succeeded, or conversely if it errored then reporting some information here too). There may be some information that makes sense to duplicate as both a condition and an event, but I'm not sure. I guess it depends on the nature of the action.
Some interesting thoughts here on PauseCondition
: https://github.com/kubernetes/kubernetes/pull/58465#issuecomment-359593540
/retest
The E2E test failures all look like unrelated flakes: ElasticSearch pod didn't become ready, Cassandra node didn't become ready....maybe because the test infrastructure was overloaded.
As discussed, it'd be worth adding some E2E tests for cluster pausing though.
3/3 failed tests, even if they are on a flake, isn't a healthy state to be in. FWIW, our test infrastructure is quite over-provisioned at the moment so I doubt it's down to overload.
Perhaps we've got some other component here that's flakey?
@kragniz: The following test failed, say /retest
to rerun them all:
Test name | Commit | Details | Rerun command |
---|---|---|---|
navigator-e2e-v1-10 | c5c798e137e90245dfef3ba924d29dffbae814f9 | link | /test e2e v1.10 |
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: wallrj
The full list of commands accepted by this bot can be found here.
The pull request process is described here
What this PR does / why we need it: this allows users to 'pause' a cluster, making navigator skip syncing. This allows for manual intervention in case something goes wrong.
Release note: