Scale down - Githubissues

wallrj commented 4 years ago

Part of #35

Scale down by removing the member with the largest ordinal name
Updated E2E tests to cleanup pods after each test, otherwise E2E tests were failing on Circle CI, because I think the VM was running so slowly. Grab logs of namespace pods before they are deleted, for diagnostics.
Updated unit / integration tests to log all the operator logs, rather than only errors. This allowed me to see exactly what the operator was doing during scale-down integration tests
Updated the Fake Etcd client to support removal of members during integration tests

wallrj commented 4 years ago

The E2E test failure in https://app.circleci.com/jobs/github/improbable-eng/etcd-cluster-operator/492/parallel-runs/0/steps/0-104 demonstrates that the etcd service is sometimes quite badly disrupted by the scale-down operations. I think because of 1 and sometimes 2 leader elections.

We could try and select non-leader members for removal but this would then often leave the member names out of sequence. Another good reason not to use ordinal names, perhaps?

But I also haven't had time to check whether there would be a leader election anyway. Perhaps leader elections happen as a result of the any change in cluster membership. In which case I can remove (or adjust ) the E2E cluster availability test.

wallrj commented 4 years ago

https://github.com/improbable-eng/etcd-cluster-operator/issues/101 Follow up issue for safe deletion of PVCs during scale-down.

improbable-eng / etcd-cluster-operator

Scale down #93