pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

scheduler cannot schedule TiKV #602

Closed gregwebs closed 5 years ago

gregwebs commented 5 years ago

Bug Report

The scheduler continually logs:

E0621 13:50:45.693332       1 mux.go:107] unable to filter nodes: waiting for Pod tidb21/demo-tikv-2 scheduling
I0621 13:50:45.706780       1 scheduler.go:105] scheduling pod: tidb21/demo-tikv-1
I0621 13:50:45.707363       1 scheduler.go:108] entering predicate: HighAvailability, nodes: [gke-alpha-tidb-n1-standard-2-375-63186231-flpw gke-alpha-tidb-n1-standard-2-375-c798b6cf-1zn1 gke-alpha-tidb-n1-standard-2-375-f079b5d0-b9q9]
E0621 13:50:46.093051       1 mux.go:107] unable to filter nodes: waiting for Pod tidb21/demo-tikv-2 scheduling
I0621 13:50:46.104954       1 scheduler.go:105] scheduling pod: tidb21/demo-tikv-0
I0621 13:50:46.104985       1 scheduler.go:108] entering predicate: HighAvailability, nodes: [gke-alpha-tidb-n1-standard-2-375-63186231-flpw gke-alpha-tidb-n1-standard-2-375-c798b6cf-1zn1 gke-alpha-tidb-n1-standard-2-375-f079b5d0-b9q9]

The kube-scheduler log is similar.

There are no events for tikv-2 when it is described.

I got to this state after creating a tidb cluster, then creating a 2nd tidb cluster and deleting the first cluster. I deleted Released PV.

weekface commented 5 years ago
kubectl describe po -n tidb21 demo-tikv-2
kubectl get pvc -n tidb21
kubectl get pv
gregwebs commented 5 years ago

kubectl describe po -n tidb21 demo-tikv-2

Name:               demo-tikv-2
Namespace:          tidb21
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             app.kubernetes.io/component=tikv
                    app.kubernetes.io/instance=tidb21
                    app.kubernetes.io/managed-by=tidb-operator
                    app.kubernetes.io/name=tidb-cluster
                    controller-revision-hash=demo-tikv-874b8bf89
                    statefulset.kubernetes.io/pod-name=demo-tikv-2
Annotations:        pingcap.com/last-applied-configuration:
                      {"volumes":[{"name":"annotations","downwardAPI":{"items":[{"path":"annotations","fieldRef":{"fieldPath":"metadata.annotations"}}]}},{"name...
                    prometheus.io/path: /metrics
                    prometheus.io/port: 20180
                    prometheus.io/scrape: true
Status:             Pending
IP:                 
Controlled By:      StatefulSet/demo-tikv
Init Containers:
  wait-for-pd:
    Image:      gcr.io/pingcap-tidb-alpha/tidb-operator:v1.0.0-beta.3.start-fast-16
    Port:       <none>
    Host Port:  <none>
    Command:
      wait-for-pd
    Environment:
      NAMESPACE:     tidb21 (v1:metadata.namespace)
      CLUSTER_NAME:  demo
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x6265 (ro)
Containers:
  tikv:
    Image:      pingcap/tikv:v3.0.0-rc.1
    Port:       20160/TCP
    Host Port:  0/TCP
    Command:
      /bin/sh
      /usr/local/bin/tikv_start_script.sh
    Requests:
      cpu:     1
      memory:  2Gi
    Environment:
      NAMESPACE:              tidb21 (v1:metadata.namespace)
      CLUSTER_NAME:           demo
      HEADLESS_SERVICE_NAME:  demo-tikv-peer
      CAPACITY:               0
      TZ:                     UTC
    Mounts:
      /etc/podinfo from annotations (ro)
      /etc/tikv from config (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/tikv from tikv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-x6265 (ro)
Volumes:
  tikv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tikv-demo-tikv-2
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      demo-tikv
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      demo-tikv
    Optional:  false
  default-token-x6265:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-x6265
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 tidb.pingcap.com/tidb-scaler=n1-standard-2-375:NoSchedule
Events:          <none>
gregwebs commented 5 years ago

kubectl get pvc -n tidb21

NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-demo-pd-0       Bound     pvc-21708809-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            pd-ssd-wait     110m
pd-demo-pd-1       Bound     pvc-2175417b-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            pd-ssd-wait     110m
pd-demo-pd-2       Bound     pvc-217981bc-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            pd-ssd-wait     110m
tikv-demo-tikv-0   Pending                                                                        local-storage   110m
tikv-demo-tikv-1   Pending                                                                        local-storage   110m
tikv-demo-tikv-2   Bound     local-pv-3c9d1093                          368Gi      RWO            local-storage   110m
gregwebs commented 5 years ago

kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM                               STORAGECLASS    REASON   AGE
local-pv-1c02244d                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-3c9d1093                          368Gi      RWO            Retain           Bound       tidb21/tikv-demo-tikv-2             local-storage            113m
local-pv-52bb53c                           368Gi      RWO            Delete           Available                                       local-storage            110m
local-pv-5e3c2064                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-69e7f7f9                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-6a9c1bf9                          368Gi      RWO            Delete           Available                                       local-storage            110m
local-pv-82f4cde9                          368Gi      RWO            Delete           Available                                       local-storage            15h
local-pv-8b5c80f4                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-92134e5d                          368Gi      RWO            Delete           Available                                       local-storage            21h
local-pv-92524f84                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-99d360f                           368Gi      RWO            Delete           Available                                       local-storage            20h
local-pv-a2973354                          368Gi      RWO            Delete           Available                                       local-storage            20h
local-pv-b06a079e                          368Gi      RWO            Delete           Available                                       local-storage            21h
local-pv-b1e66ac4                          368Gi      RWO            Delete           Available                                       local-storage            110m
local-pv-ba5e9234                          368Gi      RWO            Delete           Available                                       local-storage            22h
local-pv-bb23005c                          368Gi      RWO            Delete           Available                                       local-storage            22h
local-pv-da125dd4                          368Gi      RWO            Delete           Available                                       local-storage            44h
local-pv-e8210ae5                          368Gi      RWO            Delete           Available                                       local-storage            18h
local-pv-f4f18899                          368Gi      RWO            Delete           Available                                       local-storage            22h
pvc-21708809-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            Retain           Bound       tidb21/pd-demo-pd-0                 pd-ssd-wait              112m
pvc-2175417b-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            Retain           Bound       tidb21/pd-demo-pd-1                 pd-ssd-wait              111m
pvc-217981bc-942a-11e9-aab8-4201ac1f4008   5Gi        RWO            Retain           Bound       tidb21/pd-demo-pd-2                 pd-ssd-wait              112m
pvc-a518b9e1-920e-11e9-afc9-4201ac1f4006   2Gi        RWO            Delete           Bound       operations/tidb-data-mysql-0        standard                 2d18h
pvc-b2d05151-9200-11e9-afc9-4201ac1f4006   2Gi        RWO            Delete           Bound       monitor/database-netdata-master-0   standard                 2d19h
pvc-b2d39b21-9200-11e9-afc9-4201ac1f4006   1Gi        RWO            Delete           Bound       monitor/alarms-netdata-master-0     standard                 2d19h
weekface commented 5 years ago

The tikv-2 PVC was bound, but can't scheduled, and there are no events, so this should be a kube-scheduler problem we have met frequency in our k8s env recently? @cofyc

gregwebs commented 5 years ago

As per #468 this blocks a new cluster from being scheduled.

The tidb-scheduler logs are listed above. kube-scheduler looks the same.

E0621 16:04:47.293641       1 factory.go:1519] Error scheduling tidb21/demo-tikv-0: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0621 16:04:47.296992       1 scheduler.go:546] error selecting node for pod: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500
E0621 16:04:47.297663       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.297676       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.297912       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.297920       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.298155       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.298163       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
I0621 16:04:47.693084       1 trace.go:76] Trace[1601680201]: "Scheduling tidb21/demo-tikv-1" (started: 2019-06-21 16:04:47.297100315 +0000 UTC m=+161567.461774652) (total time: 395.944579ms):
Trace[1601680201]: [395.944579ms] [395.883901ms] END
E0621 16:04:47.694697       1 factory.go:1519] Error scheduling tidb21/demo-tikv-1: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0621 16:04:47.701308       1 scheduler.go:546] error selecting node for pod: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500
E0621 16:04:47.702889       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.702909       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.702929       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.702950       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.703229       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
E0621 16:04:47.703242       1 predicates.go:1277] Node not found, gke-alpha-tidb-custom-6-11008-0-12ae9ca3-xp99
I0621 16:04:48.092963       1 trace.go:76] Trace[147365297]: "Scheduling tidb21/demo-tikv-0" (started: 2019-06-21 16:04:47.7020882 +0000 UTC m=+161567.866762536) (total time: 390.828437ms):
Trace[147365297]: [390.828437ms] [390.756271ms] END
cofyc commented 5 years ago

Is the unscheduled pod retried by the scheduler repeatedly? If the scheduler retries scheduling the pod but always fail, it is unrelated to the issue we found in IDC k8s env. In IDC k8s env, the tidb-scheduler didn't try to schedule the new TikV pods.

gregwebs commented 5 years ago

Yes, it keeps trying to schedule.

weekface commented 5 years ago

@cofyc suggests:

or uprade to v1.14+

@gregwebs can you have a try?

gregwebs commented 5 years ago

I filled out a form to be an alpha user of 1.14 on GKE. I am still waiting... tidb-operator is using kube-scheduler v1.13.6 which matches the GKE version when it was installed. I will update my version of tidb-operator.

gregwebs commented 5 years ago

I cannot reproduce this anymore