K8SPS-359: Fix rebooting cluster if it's partially online

percona / percona-server-mysql-operator

Percona Operator for MySQL

Apache License 2.0

130 stars 26 forks source link

CHANGE DESCRIPTION

Problem: If 2 out of 3 MySQL pods are terminated abruptly, operator fails to reboot the cluster (sporadically).

Cause: Even though failed pods go into crash recovery mode once they come up again, MySQL complains about cluster being already ONLINE when operator tries to run dba.rebootClusterFromCompleteOutage().

Solution: If operator can't run dba.rebootClusterFromCompleteOutage() because cluster is already online, delete all MySQL pods to force recovery.

CHECKLIST

Jira

[x] Is the Jira ticket created and referenced properly?
[x] Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
[x] Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

[x] Is an E2E test/test case added for the new feature/change?
[x] Are unit tests added where appropriate?

Config/Logging/Testability

[x] Are all needed new/changed options added to default YAML files?
[x] Did we add proper logging messages for operator actions?
[x] Did we ensure compatibility with the previous version or cluster upgrade process?
[x] Does the change support oldest and newest supported PS version?
[x] Does the change support oldest and newest supported Kubernetes version?

Test name	Status
version-service	passed
async-ignore-annotations	passed
auto-config	passed
config	passed
config-router	passed
demand-backup	passed
gr-demand-backup	passed
gr-demand-backup-haproxy	passed
gr-finalizer	passed
gr-haproxy	passed
gr-ignore-annotations	passed
gr-init-deploy	passed
gr-one-pod	passed
gr-recreate	passed
gr-scaling	passed
gr-scheduled-backup	passed
gr-security-context	passed
gr-self-healing	passed
gr-tls-cert-manager	passed
gr-users	passed
haproxy	passed
init-deploy	passed
limits	passed
monitoring	passed
one-pod	passed
operator-self-healing	passed
recreate	passed
scaling	passed
scheduled-backup	passed
service-per-pod	passed
sidecars	passed
smart-update	passed
tls-cert-manager	passed
users	passed
We run 34 out of 34

Test name

Status

version-service

percona / percona-server-mysql-operator

K8SPS-359: Fix rebooting cluster if it's partially online #725

CHANGE DESCRIPTION

CHECKLIST