Setting unsatisfiable affinity requirements leads to broken TiDB cluster

Hey TiDB developers,

We found that if we set spec.tidb.affinity to some affinity requirement that cannot be satisfied by the current cluster status, then at least one TiDB pod will fail because the statefulset controller will restart the pod with the updated affinity requirement and the Kubernetes will find no way to schedule the TiDB pod. More importantly, we find the TiDB operator cannot recover TiDB from this failure because it always waits for all pods to become ready before applying the next update (as already reported here: https://github.com/pingcap/tidb-operator/issues/4946).

A concrete example is to run TiDB in a cluster with only 3 nodes and set

spec:
  tidb:
    replicas: 5
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          labelSelector:
            matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
              - test-cluster
          topologyKey: kubernetes.io/hostname

TiDB pod keeps failing to be get scheduled because no machine can satisfy the Affinity rule in the cluster.

We are not sure if this should be counted as a bug and how to fix it because it is difficult for the operator to tell whether the affinity requirement is unsatisfiable before updating the statefulset. The scheduling logic in implemented in the scheduler. However, the consequence of the issue is severe as TiDB pods cannot get started, and we find no way to recover from the failure (neither resetting affinity nor restarting the operator works). We also found that Affinity is not the only property to break the TiDB statefulSet, there are a lot of other properties, such as priorityClassName, when set incorrectly, can harm the reliability of the cluster.

We want to open this issue separately to discuss what should be the best practice to handle this issue, or what functionalities should the Kubernetes provide to make this validation easier. Is there a way to prevent the bad operation from happening in the first place, or there is a way for tidb-operator to automatically recognize the statefulSet is stuck and perform an automatic recovery. If you know of any practical code fixes for this issue, we are also happy to send a PR for that.

pingcap / tidb-operator

Setting unsatisfiable affinity requirements leads to broken TiDB cluster #4960