pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

Setting unsatisfiable affinity requirements leads to broken TiDB cluster #4960

Open hoyhbx opened 1 year ago

hoyhbx commented 1 year ago

Hey TiDB developers,

We found that if we set spec.tidb.affinity to some affinity requirement that cannot be satisfied by the current cluster status, then at least one TiDB pod will fail because the statefulset controller will restart the pod with the updated affinity requirement and the Kubernetes will find no way to schedule the TiDB pod. More importantly, we find the TiDB operator cannot recover TiDB from this failure because it always waits for all pods to become ready before applying the next update (as already reported here: https://github.com/pingcap/tidb-operator/issues/4946).

A concrete example is to run TiDB in a cluster with only 3 nodes and set

spec:
  tidb:
    replicas: 5
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - test-cluster
          topologyKey: kubernetes.io/hostname

TiDB pod keeps failing to be get scheduled because no machine can satisfy the Affinity rule in the cluster.

We are not sure if this should be counted as a bug and how to fix it because it is difficult for the operator to tell whether the affinity requirement is unsatisfiable before updating the statefulset. The scheduling logic in implemented in the scheduler. However, the consequence of the issue is severe as TiDB pods cannot get started, and we find no way to recover from the failure (neither resetting affinity nor restarting the operator works). We also found that Affinity is not the only property to break the TiDB statefulSet, there are a lot of other properties, such as priorityClassName, when set incorrectly, can harm the reliability of the cluster.

We want to open this issue separately to discuss what should be the best practice to handle this issue, or what functionalities should the Kubernetes provide to make this validation easier. Is there a way to prevent the bad operation from happening in the first place, or there is a way for tidb-operator to automatically recognize the statefulSet is stuck and perform an automatic recovery. If you know of any practical code fixes for this issue, we are also happy to send a PR for that.

csuzhangxc commented 1 year ago

As you have said, "it is difficult for the operator to tell whether the affinity requirement is unsatisfiable before updating the statefulset.".

In the above blocking cases, can you try to delete the old StatefulSet (but do not delete the pods and let them become orphans) and let the TiDB-Operator recreate a new StatefulSet with the correct spec?