pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.23k stars 498 forks source link

Deployment to AWS fails due to timeout when scheduled backup is enabled #633

Closed sokada1221 closed 5 years ago

sokada1221 commented 5 years ago

Bug Report

What version of Kubernetes are you using?

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.1", GitCommit:"b7394102d6ef778017f2ca4046abbaa23b88c290", GitTreeState:"clean", BuildDate:"2019-04-19T22:12:47Z", GoVersion:"go1.12.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.6-eks-d69f1b", GitCommit:"d69f1bf3669bf00b7f4a758e978e0e7a1e3a68f7", GitTreeState:"clean", BuildDate:"2019-02-28T20:26:10Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using? latest master

$ kubectl exec -n tidb-admin tidb-controller-manager-545d6c854d-xhrzx -- tidb-controller-manager -V
TiDB Operator Version: version.Info{TiDBVersion:"2.1.0", GitVersion:"v1.0.0-beta.3", GitCommit:"6257dfaad68f55f745f20f6f5d19b10bea2b0bea", GitTreeState:"clean", BuildDate:"2019-06-06T09:51:04Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

$ kubectl get sc
NAME            PROVISIONER                    AGE
ebs-gp2         kubernetes.io/aws-ebs          17h
gp2 (default)   kubernetes.io/aws-ebs          17h
local-storage   kubernetes.io/no-provisioner   17h
$ kubectl get pvc -n shinno-cluster
NAME                              STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-shinno-cluster-pd-0            Bound     pvc-f3bd7c65-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         18h
pd-shinno-cluster-pd-1            Bound     pvc-f3bf7154-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         18h
pd-shinno-cluster-pd-2            Bound     pvc-f3c186e2-9dd5-11e9-a2dd-0a8e4e0a1ba8   1Gi        RWO            ebs-gp2         18h
shinno-cluster-monitor            Bound     pvc-f2c948ca-9dd5-11e9-81c4-026779400d00   100Gi      RWO            ebs-gp2         18h
shinno-cluster-scheduled-backup   Pending                                                                        ebs-gp2         7m5s
tikv-shinno-cluster-tikv-0        Bound     local-pv-2ced23ed                          366Gi      RWO            local-storage   18h
tikv-shinno-cluster-tikv-1        Bound     local-pv-6935efbf                          366Gi      RWO            local-storage   18h
tikv-shinno-cluster-tikv-2        Bound     local-pv-facd00f4                          366Gi      RWO            local-storage   18h

What's the status of the TiDB cluster pods?

$ kubectl get po -n shinno-cluster -o wide
NAME                                      READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE
shinno-cluster-discovery-d6c4df7f-m5ht6   1/1     Running   0          18h   10.0.54.124   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-monitor-55f87b9755-djq2h   2/2     Running   0          18h   10.0.58.189   ip-10-0-62-124.us-east-2.compute.internal   <none>
shinno-cluster-pd-0                       1/1     Running   0          18h   10.0.52.76    ip-10-0-52-61.us-east-2.compute.internal    <none>
shinno-cluster-pd-1                       1/1     Running   1          18h   10.0.30.162   ip-10-0-27-179.us-east-2.compute.internal   <none>
shinno-cluster-pd-2                       1/1     Running   0          18h   10.0.46.121   ip-10-0-45-73.us-east-2.compute.internal    <none>
shinno-cluster-tidb-0                     1/1     Running   0          18h   10.0.61.93    ip-10-0-53-222.us-east-2.compute.internal   <none>
shinno-cluster-tidb-1                     1/1     Running   0          18h   10.0.43.87    ip-10-0-40-191.us-east-2.compute.internal   <none>
shinno-cluster-tikv-0                     1/1     Running   1          18h   10.0.24.56    ip-10-0-19-9.us-east-2.compute.internal     <none>
shinno-cluster-tikv-1                     1/1     Running   1          18h   10.0.55.241   ip-10-0-48-170.us-east-2.compute.internal   <none>
shinno-cluster-tikv-2                     1/1     Running   0          18h   10.0.46.237   ip-10-0-34-248.us-east-2.compute.internal   <none>

What did you do?

  1. Add the following to deploy/aws/default-cluster.yaml:
    scheduledBackup:
    create: true
    # https://github.com/pingcap/tidb-cloud-backup
    mydumperImage: pingcap/tidb-cloud-backup:20190610
    mydumperImagePullPolicy: IfNotPresent
    # storageClassName is a StorageClass provides a way for administrators to describe the "classes" of storage they offer.
    # different classes might map to quality-of-service levels, or to backup policies,
    # or to arbitrary policies determined by the cluster administrators.
    # refer to https://kubernetes.io/docs/concepts/storage/storage-classes
    storageClassName: ebs-gp2
    storage: 100Gi
    # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#schedule
    schedule: "0 0 */1 * *"
    # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#suspend
    suspend: false
    # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#jobs-history-limits
    successfulJobsHistoryLimit: 3
    failedJobsHistoryLimit: 1
    # https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#starting-deadline
    startingDeadlineSeconds: 3600
    # https://github.com/maxbube/mydumper/blob/master/docs/mydumper_usage.rst#options
    options: "--verbose=3"
    # secretName is the name of the secret which stores user and password used for backup
    # Note: you must give the user enough privilege to do the backup
    # you can create the secret by:
    # kubectl create secret generic backup-secret --from-literal=user=root --from-literal=password=<password>
    secretName: backup-secret
    # backup to s3
    s3:
    region: "us-east-2"
    bucket: "tidb-test"
    # secretName is the name of the secret which stores s3 object store access key and secret key
    # You can create the secret by:
    # kubectl create secret generic s3-backup-secret --from-literal=access_key=<access-key> --from-literal=secret_key=<secret-key>
    secretName: s3-backup-secret
  2. terraform apply from deploy/aws
  3. helm release fails

What did you expect to see? Helm deployment completes within the default timeout (5 minutes?) so that terraform apply successfully completes.

What did you see instead? terraform apply fails due to timeout during helm release.

$ terraform apply
...
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 4m0s elapsed]
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 4m10s elapsed]
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 4m20s elapsed]
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 4m30s elapsed]
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 4m40s elapsed]
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 4m50s elapsed]
module.default-cluster.helm_release.tidb-cluster: Still modifying... [id=shinno-cluster, 5m0s elapsed]

Error: rpc error: code = Unknown desc = timed out waiting for the condition

  on tidb-cluster/cluster.tf line 24, in resource "helm_release" "tidb-cluster":
  24: resource "helm_release" "tidb-cluster" {

Root cause is that the pvc is pending and waiting for the cronjob to be triggered for the first time. Helm release is pending until pvc is bound.

$ kubectl describe pvc shinno-cluster-scheduled-backup -n shinno-cluster
Name:          shinno-cluster-scheduled-backup
Namespace:     shinno-cluster
StorageClass:  ebs-gp2
Status:        Pending
Volume:        
Labels:        app.kubernetes.io/component=scheduled-backup
               app.kubernetes.io/instance=shinno-cluster
               app.kubernetes.io/managed-by=tidb-operator
               app.kubernetes.io/name=tidb-cluster
               helm.sh/chart=tidb-cluster-v1.0.0-beta.3
               pingcap.com/backup-cluster-name=shinno-cluster
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
Events:
  Type       Reason                Age                From                         Message
  ----       ------                ----               ----                         -------
  Normal     WaitForFirstConsumer  4s (x43 over 10m)  persistentvolume-controller  waiting for first consumer to be created before binding
Mounted By:  <none>
$ kubectl get cronjobs -n shinno-cluster
NAME                              SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
shinno-cluster-scheduled-backup   0 0 */1 * *   False     0        <none>          10m

Workaround

  1. Deploy with a cronjob that gets executed within 5 minutes -> Deployment succeeds
  2. Change the cronjob expression to the one you like -> Deployment also succeeds because pvc is already bound

OR

  1. Set the timeout argument for the helm_release resource so that the terraform waits until the configured cronjob is triggered for the first time.
aylei commented 5 years ago

Thanks for your detailed report Shinno, the problem is that the helm_release resource will wait for all resources running by default. I thinks we should disable waiting for helm release, then the behavior will be same as an ordinary helm install.

sokada1221 commented 5 years ago

Sounds good Aylei! Thanks.