Hi, we found that the operator is unable to recover a broken statefulSet, after a misoperation.
For example, if we set the image of the zookeeper cluster to a wrong image, the statefulSet will be updated by the zookeeper cluster and the rolling update will cause one pod to keep crashing due to ImagePull error.
Then we realized this error, and performed a manual roll back to fix the image. But we found that the pod still keeps crashing, although the statefulSet is updated.
We think the root cause is because the operator uses OrderReady as the podManagementPolicy, and there is a known problem in statefulSet: https://github.com/kubernetes/kubernetes/issues/67250.
which prevents statefulSet to roll back even the template is updated. And zookeeper-operator is affected.
The workaround is to manually delete the crashed pod so that statefulSet controller can proceed. As far as we know, there is a KEP open to fix this issue: https://github.com/kubernetes/enhancements/pull/3562, but it is still at a very early stage. The best thing for the operator to do here is probably to delete the pod if it can recognize the pod is being stuck. If the KEP gets actually implemented and merged, this problem will be much easier to deal with.
Importance
(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have))
must-have
Description
Hi, we found that the operator is unable to recover a broken statefulSet, after a misoperation. For example, if we set the image of the zookeeper cluster to a wrong image, the statefulSet will be updated by the zookeeper cluster and the rolling update will cause one pod to keep crashing due to ImagePull error. Then we realized this error, and performed a manual roll back to fix the image. But we found that the pod still keeps crashing, although the statefulSet is updated.
We think the root cause is because the operator uses OrderReady as the podManagementPolicy, and there is a known problem in statefulSet: https://github.com/kubernetes/kubernetes/issues/67250. which prevents statefulSet to roll back even the template is updated. And zookeeper-operator is affected.
The workaround is to manually delete the crashed pod so that statefulSet controller can proceed. As far as we know, there is a KEP open to fix this issue: https://github.com/kubernetes/enhancements/pull/3562, but it is still at a very early stage. The best thing for the operator to do here is probably to delete the pod if it can recognize the pod is being stuck. If the KEP gets actually implemented and merged, this problem will be much easier to deal with.
Importance
(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have)) must-have
Location
https://github.com/kubernetes/kubernetes/issues/67250 https://github.com/kubernetes/enhancements/pull/3562
Suggestions for an improvement
Force restart the pod if the operator can recognize the pod is at unhealthy state, so that the statefulSet pods can be updated