failover and scaling are blocked if one Pod failed during rolling update

DanielZhangQD commented 4 years ago

Feature Request

Is your feature request related to a problem? Please describe:

Take TiKV for example, during TiKV rolling update, if one TiKV Pod failed, e.g. something wrong with the node it was running and it cannot be scheduled to any node, then the upgrade will be stuck in waiting this Pod ready and its store UP, however, in this case, if the failover cannot occur because it's blocked by the logic https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tikv_upgrader.go#L110-L115, and if users want to scale out a new TiKV to increase the replicas, it still is impossible due to the same reason. Describe the feature you'd like:

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

DanielZhangQD commented 4 years ago

@weekface @cofyc @Yisaer WDYT about this issue?

DanielZhangQD commented 4 years ago

PD and TiDB should have a similar fix.

pingcap / tidb-operator

failover and scaling are blocked if one Pod failed during rolling update #2739

Feature Request