pravega / zookeeper-operator

Kubernetes Operator for Zookeeper
Apache License 2.0
368 stars 206 forks source link

Issue 493 Allow operator to recover from FailedUpgrade #495

Closed lunarfs closed 2 years ago

lunarfs commented 2 years ago

If the upgrade is taking more than the timeout to get back in the cluster) the operator cannot recover even if the cluster is healthy. adding steps to recover once the cluster is completely upgraded (potentially manual work)

Change log description

Fix operator stuck in FailedUpgrade, even after the cluster is upgraded and healthy.

Purpose of the change

fixes #493

What the code does

Check if the Statefullset is fully upgrade when the operator is in UpgradeFailed mode, if the cluster is fully upgraded and healthy, remove the failed state and complete the upgrade

How to verify it

make sure that a node will not get online whit-in the 10 min timeout when doing an upgrade verify that the cluster is stuck in UpgradeFailed and e.g. 2 out of 3 nodes ready upgrade the operator to a build containing this fix, and verify that the upgrade completes. obviously if you are already running this version, a failed upgrade vil recover once the cluster is in a good and upgraded state