tidb-operator unable to recover an unhealthy cluster even with manual revert

hoyhbx commented 1 year ago

Bug Report

What version of Kubernetes are you using?

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.9", GitCommit:"6df4433e288edc9c40c2e344eb336f63fad45cd2", GitTreeState:"clean", BuildDate:"2022-05-19T19:53:08Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

pingcap/tidb-operator:v1.3.2

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

default

What's the status of the TiDB cluster pods?

NAME                                      READY   STATUS    RESTARTS   AGE     IP            NODE           NOMINATED NODE   READINESS GATES
test-cluster-discovery-779bb58fc7-q2kqv   1/1     Running   0          3m45s   10.244.4.11   kind-worker4   <none>           <none>
test-cluster-pd-0                         1/1     Running   0          3m45s   10.244.1.10   kind-worker3   <none>           <none>
test-cluster-pd-1                         1/1     Running   0          3m45s   10.244.4.12   kind-worker4   <none>           <none>
test-cluster-pd-2                         1/1     Running   0          3m45s   10.244.2.8    kind-worker    <none>           <none>
test-cluster-tidb-0                       2/2     Running   0          2m58s   10.244.2.9    kind-worker    <none>           <none>
test-cluster-tidb-1                       2/2     Running   0          2m58s   10.244.1.12   kind-worker3   <none>           <none>
test-cluster-tidb-2                       0/2     Pending   0          53s     <none>        <none>         <none>           <none>
test-cluster-tikv-0                       2/2     Running   0          3m31s   10.244.1.11   kind-worker3   <none>           <none>
test-cluster-tikv-1                       2/2     Running   0          3m31s   10.244.4.13   kind-worker4   <none>           <none>
test-cluster-tikv-2                       2/2     Running   0          3m31s   10.244.3.8    kind-worker2   <none>           <none>

What did you do?

We first have a healthy tidb cluster. Then we changed the affinity rule for scheduling the tidb pods. The tidb statefulset is updated with the new affinity rule and performed a rolling upgrade. Then we realized that the new affinity rule cannot be satisfied for the current topology of the cluster, there is always one replica cannot be scheduled. We then reverted the CR, to remove the new affinity rule. But then the operator fails to update the tidb statefulset, leaving the cluster under unhealthy state.

What did you expect to see? tidb-operator should be able to recover the cluster, with manual revert.

What did you see instead? tidb-operator refuses to update the statefulset, before all replicas in the cluster become ready. The log indicating this behavior is tidb_cluster_controller.go:124] TidbCluster: default/test-cluster, still need sync: tidbcluster: [default/test-cluster]'s upgraded tidb pod: [test-cluster-tidb-2] is not ready, requeuing

yiduoyunQ commented 1 year ago

hi, you can first scale-in tidb to replica:2, let cluster become normal, then operator can proceed the sync logic, after changing the affinity rules, you can scale-out back tidb-2 if you want.

hoyhbx commented 1 year ago

Thanks @yiduoyunQ for the suggestion! Is there a possible way to fix this fundamentally in operator? Because revert is probably the most intuitive way to recover the cluster after a misoperation. Or what is the challenge for the operator to recognize the statefulSet is never going to be entirely ready and start processing updates to statefulSet again?

yiduoyunQ commented 1 year ago

hi hoyhbx, this error message is reported from https://github.com/pingcap/tidb-operator/blob/v1.4.4/pkg/manager/member/tidb_upgrader.go#L123-L133, it block Operator continue do upgradeTiDBPod.

Current Operator sync logic is sync each component type one by one in an order, so we need one determinate condition to check if it can continue sync the next component, the component pod's status is used here in upgrade sync logic.

For example if we change the pod's status check after upgradeTiDBPod, it may lead Operator continuously try upgrade TiDB pod if there is one or more TiDB pod's status always can not be ready.

What do you think about this?

yiduoyunQ commented 1 year ago

if we already in the clear upgrade logic https://github.com/pingcap/tidb-operator/blob/v1.4.4/pkg/manager/member/tidb_member_manager.go#L305 , do we still need check the pod's status? or we can just let it start upgrade. @hanlins

hoyhbx commented 1 year ago

Yeah I think it is a good practice to wait for the component to get ready and then apply next upgrade. The operator is doing a good job here. But this conservative wait is sort of a double edged sword and causes this issue, I think the fundamental challenge to fix this issue here is to tell whether

the Pods are not ready because the upgrade is in progress, and the pods will eventually become ready
the Pods are not ready because of some unsatisfiable conditions, e.g. not schedulable

I think the status of the Pod may contain some more useful information to distinguish the two cases. For example, if the Pod has been in Pending state for over 5 minutes, then the operator can determine the upgrade has failed. Although the hardcoded timeout may be flaky, because in large clusters it may take a Pod over 5 minutes to get scheduled.

csuzhangxc commented 1 year ago

yes. If we want to apply the next upgrade, we must tell every unhealthy case whether it's safe to process.

hoyhbx commented 1 year ago

@csuzhangxc Do you think this would be a feature request for the statefulSet controller? Having a field in statefulSet status would greatly help operators to recognize failed upgrade

csuzhangxc commented 1 year ago

@csuzhangxc Do you think this would be a feature request for the statefulSet controller? Having a field in statefulSet status would greatly help operators to recognize failed upgrade

Yes, this may be helpful.

hoyhbx commented 1 year ago

@yiduoyunQ @csuzhangxc Can we have an intermediate fix for this bug before the upstream Kubernetes reacts? We are drafting an issue to send to Kubernetes at the same time.

A potential fix for this bug is to recognize the error condition that the cluster is under, and proceed to upgrade the statefulSet if the Pods have been in the error condition for a timeout.

The symptom of this bug is that there are some pods in "Pending" state, with error conditions, e.g.:

{
      "last_probe_time": null,
      "last_transition_time": "2023-04-02T12:10:37+00:00",
      "message": "0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) didn't match Pod's node affinity/selector.",
      "reason": "Unschedulable",
      "status": "False",
      "type": "PodScheduled"
}

Note that the condition has the last_transition_time field which can be used to determine how long the Pod has been in this error state. There are other error states which the Pod can be under which may manifest in other Conditions, e.g. ContainersReady, Initialized, Ready. All of them would contain the last_transition_time which we can use to decide if the next upgrade should proceed.

We propose to add a condition branch at https://github.com/pingcap/tidb-operator/blob/4b1c55c9aabf0410e276efcb460e37488f203b66/pkg/manager/member/tidb_upgrader.go#L123-L133 to check if the previous upgrade has failed (through timeout by checking the last_transition_time field in conditions). If the previous upgrade has failed, the operator can proceed to next upgrade. The timeout can be set with conservatively large number, or be a field in the CRD so users can configure.

csuzhangxc commented 1 year ago

@hoyhbx this is something like a force upgrade, right? I think adding a new field may be better and wouldn't break the existing behavior. And may need to address which conditions this behavior will take effect.

hoyhbx commented 1 year ago

@csuzhangxc , yes, a feature like force upgrade would fix this. Having a boolean property force for each statefulSet, and if the force is set to true, the wait conditions will be skipped and directly proceed to upgrade.

pingcap / tidb-operator

tidb-operator unable to recover an unhealthy cluster even with manual revert #4946

Bug Report