Open kos-team opened 3 weeks ago
can you show the log of TiFlash? and what's the TidbCluster CR .status
after scaled in to 0
?
Here is the TiFlash log, the key error message is [\"failed to start node: StoreTombstone(\\"store is tombstone\\")\"]:
Here is the dump of the TiDBCluster CR Status:
it seems we need to use a new PV (delete the PVC/PV after scaled in to 0) without the data for the previous Store
Bug Report
What version of Kubernetes are you using? Client Version: v1.31.1 Kustomize Version: v5.4.2
What version of TiDB Operator are you using? v1.6.0
What's the status of the TiDB cluster pods? TiFlash pods are in
CrashBackOffLoop
State.What did you do? We scaled in tiflash from 3 to 0 and then scaled it out from 0 to 3.
How to reproduce
spec.tiflash.replicas
from 3 to 0:spec.tiflash.replicas
back to 3.What did you expect to see? We expected that TiFlash pods are running and be in
Healthy
stateWhat did you see instead? The Tiflash pods kept crashing and be in
CrashBackOffLoop
state.Root Cause We think the root cause of this problem is that when scaling in the TiFlash, the stores will be in
Tombstone
state. After we change thespec.tiflash.replicas
from 0 to 3, the operator will delete the original statefulset and create a new one withreplicas
set to 3 instead of changing the original statefulset. This behaviour bypasses theScaleOut
function at this line https://github.com/pingcap/tidb-operator/blob/master/pkg/manager/member/tiflash_scaler.go#L52. After encounter this issue, the user cannot simply delete the CR and apply it again to make the TiFlash run correctly as the operator will not delete pvcs after user deletes CR causing the new cluster reusing the stores that inTombstone
state.