TiKV cluster not respecting CRD replica configuration

mirandaconrado commented 4 years ago

Bug Report

What version of Kubernetes are you using? 1.11.10 on EKS

What version of TiDB Operator are you using? 1.0.1

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods? gp2

What's the status of the TiDB cluster pods? Running, although we did have the single PD in a crash loop due to OOM and we have increased its memory.

What did you do? We had a 1 PD, 3 TiKV and 1 TiDB pod setup, which was working fine. The PD pod was killed for some reason and it got into a crash loop backoff due to OOM. Then the number of pods of the TiKV configuration was increased drastically to 45 over the course of 8 hours.

What did you expect to see? I expected the controller to 1) not increase the number of TiKV pods while the CRD specified 3 when PD 2) restore to 3 replicas when the PD pod was back Running.

What did you see instead? The cluster remain at 45 replicas and I believe the controller might have some internal state that is corrupt. When using tkctl, I get

$ ~/Downloads/tkctl info -t data-monitoring--tidb-cluster
Name:               data-monitoring--tidb-cluster
Namespace:          production
CreationTimestamp:  2019-09-25 10:08:59 -0700 PDT
Overview:
         Phase   Ready  Desired  CPU   Memory  Storage  Version
         -----   -----  -------  ---   ------  -------  -------
  PD:    Normal  1      1        100m  256Mi   1Gi      pingcap/pd:v3.0.1    
  TiKV:  Normal  45     45       250m  2Gi     20Gi     pingcap/tikv:v3.0.1  
  TiDB   Normal  1      1        250m  1Gi              pingcap/tidb:v3.0.1  
Endpoints(ClusterIP):
  Cluster IP:  172.20.124.214

but the CRD correctly points the desired is 3

$ k get tidbcluster
NAME                            PD                  STORAGE   READY   DESIRE   TIKV                  STORAGE   READY   DESIRE   TIDB                  READY   DESIRE
data-monitoring--tidb-cluster   pingcap/pd:v3.0.1   1Gi       1       1        pingcap/tikv:v3.0.1   20Gi      45      3        pingcap/tidb:v3.0.1   1       1

Restarting the controller pod did not help. Setting the replica set size to 3 directly made the controller bring it back to 8. Then setting it to 3 again made it stay at 3. However, the controller logs have things like

E0927 15:14:00.760483       1 tidb_cluster_controller.go:245] TidbCluster: production/data-monitoring--tidb-cluster, sync failed pod "data-monitoring--tidb-cluster-tikv-29" not found, requeuing

for each pod that is missing and they still appear as part of the CRD status.

Supplementary material CRD status: https://pastebin.com/haQwqa6Q Pod list: https://pastebin.com/1nGR98Lv

tennix commented 4 years ago

@mirandaconrado The increased pod is triggered by auto failover.

The CRD replicas is 3 means users want 3 healthy TiKV stores. When TiKV fails for some time (default 35min), tidb-operator will try to add more TiKVs to recover the cluster to 3 healthy stores. Just like if a pod of a Deployment fails, the Deployment controller will try to add more pods to ensure the healthy pods meet the Deployment spec replicas.

But failover from 3 to 45 is unreasonable, tidb-operator needs to limit the failover count. Since in this case, no matter how many pods are added, the cluster will never return back to normal. Users must recover the cluster manually.

weekface commented 4 years ago

@mirandaconrado Thank you very much for your feedback.

Now, you can remove the failover Pods by remove status.tikv.failureStores field of the TidbCluster object manually: https://pingcap.com/docs/v3.0/tidb-in-kubernetes/maintain/auto-failover/#failover-with-tikv

What are the logs of the failed TiKV pods? The failover may be caused by the too-small resources, this is the recommendations of the resources for development and test environments: https://pingcap.com/docs/v3.0/how-to/deploy/hardware-recommendations/#development-and-test-environments

And we will add a maxFailoverCount limit for TiKV: https://github.com/pingcap/tidb-operator/issues/508

mirandaconrado commented 4 years ago

@tennix thanks for the view! It makes sense for the controller to have create different pods to bring things back to normality. I think the part that I was surprised was the fact that it didn't kill the pod and brought one over, but the documentation clearly explains that stateful set doesn't allow this functionality.

Thanks for the pointer, @weekface! I'll definitely read more on the maintenance documentation. And thanks for considering the failover limit. I checked the logs and didn't see anything funky. I'm going to separate our TiDB setup from the other explorations done by other folks to help have better/more self contained logs. My best understanding was that PD was overwhelmed due to very low amount of resources and made it crash OOM in a loop, so it was most likely a mistake on my end.

For my understanding, is there a reason why the operator doesn't automatically detect that the failure is gone? Also, if I delete the node from the status failures, will the controller automatically scale the stateful set? I want to make sure I'm not messing up when I do that manually.

tennix commented 4 years ago

For my understanding, is there a reason why the operator doesn't automatically detect that the failure is gone? Also, if I delete the node from the status failures, will the controller automatically scale the stateful set? I want to make sure I'm not messing up when I do that manually.

Recover to the original state requires to delete pod and pvc, it both have performance cost and dangerous. We'll implement this feature when we have more confidence about the actual status of the environment and cluster status.

When the cluster really return back to normal, delete the node from the status failures will trigger scale-in safely: offline the tikv store and then delete the pod.

BTW, for production setup, it's strongly suggested to deploy 3 PD pods.

mirandaconrado commented 4 years ago

Yep, totally agree that's pretty dangerous. Just wanted to confirm this is a security measure and that I'm not missing anything. Good to know the controller will handle things cleanly after I give it the green light.

And yeah, we'll most definitely run HA on production. Right now I'm mainly experimenting with TiDB for some use cases we have.

I'm inclined to close the ticket as #508 is a more appropriate description of the reasonable improvement for what originated this ticket. But I'm not sure the process your team uses: close and use the other one or keep it open for referral. I'm fine with either (or maybe that won't be necessary since there's already a PR!).

tennix commented 4 years ago

We prefer to close this issue and use #508 to track the problem.

pingcap / tidb-operator

TiKV cluster not respecting CRD replica configuration #963

Bug Report