pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.24k stars 499 forks source link

PD not scheduled #682

Closed gregwebs closed 5 years ago

gregwebs commented 5 years ago

Bug Report

This may be the same as #602. pd-0 is Pending and is not seen in the scheduler logs. pd-1 and pd-2 are waiting for pd-0 to be scheduled.

What version of Kubernetes are you using? 1.13.6

What version of TiDB Operator are you using?

After seeing this problem I tried running the latest master and I also tried running rc1. Neither will schedule PD.

Logs

The kube-scheduler log repeats with

I0723 14:52:00.630891       1 trace.go:76] Trace[684514216]: "Scheduling tidb7/demo-pd-1" (started: 2019-07-23 14:52:00.242008775 +0000 UTC m=+114.426597317) (total time: 388.821399ms):
Trace[684514216]: [388.821399ms] [388.803763ms] END
E0723 14:52:00.631042       1 factory.go:1519] Error scheduling tidb7/demo-pd-1: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0723 14:52:00.632989       1 scheduler.go:546] error selecting node for pod: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500
I0723 14:52:01.031318       1 trace.go:76] Trace[2085501207]: "Scheduling tidb7/demo-pd-2" (started: 2019-07-23 14:52:00.643467032 +0000 UTC m=+114.828055557) (total time: 387.809832ms):
Trace[2085501207]: [387.809832ms] [387.79063ms] END
E0723 14:52:01.031646       1 factory.go:1519] Error scheduling tidb7/demo-pd-2: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500; retrying
E0723 14:52:01.034107       1 scheduler.go:546] error selecting node for pod: Failed filter with extender at URL http://127.0.0.1:10262/scheduler/filter, code 500
I0723 14:52:01.432920       1 trace.go:76] Trace[1957636752]: "Scheduling tidb7/demo-pd-1" (started: 2019-07-23 14:52:01.041822114 +0000 UTC m=+115.226410651) (total time: 391.045319ms):
Trace[1957636752]: [391.045319ms] [391.008998ms] END

tidb-scheduler repeats with

entering predicate: HighAvailability, nodes: [gke-operator-default-pool-3e1fd1a0-88wc gke-operator-default-pool-d33c8b05-cwg6 gke-operator-tidb-n1-highcpu-2-0-1ccc9d2b-x743 gke-operator-tidb-n1-highcpu-2-0-690b4b11-zh8b gke-operator-tidb-n1-highcpu-2-0-fb2dbf14-c0ms]
E0723 15:24:42.035867       1 mux.go:107] unable to filter nodes: waiting for Pod tidb7/demo-pd-0 scheduling
I0723 15:24:42.137336       1 scheduler.go:105] scheduling pod: tidb7/demo-pd-1
I0723 15:24:42.138256       1 scheduler.go:108] entering predicate: HighAvailability, nodes: [gke-operator-default-pool-3e1fd1a0-88wc gke-operator-default-pool-d33c8b05-cwg6 gke-operator-tidb-n1-highcpu-2-0-1ccc9d2b-x743 gke-operator-tidb-n1-highcpu-2-0-690b4b11-zh8b gke-operator-tidb-n1-highcpu-2-0-fb2dbf14-c0ms]
E0723 15:24:42.436379       1 mux.go:107] unable to filter nodes: waiting for Pod tidb7/demo-pd-0 scheduling
I0723 15:24:42.443622       1 scheduler.go:105] scheduling pod: tidb7/demo-pd-2
gregwebs commented 5 years ago

pd-0 has a bound cloud disk already, whereas the others do not. It seems as if pd-0 was getting scheduled to a node but then it didn't work out. However, there are no Events for the pd-0 pod (or the pvc or the statefulset). Maybe the pod failed and got re-created?

kubectl get pvc -n tidb7 | grep pd                                                                                                                         
NAME               STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-demo-pd-0       Bound     pvc-dc8019c5-ad4c-11e9-a5c8-4201ac1f4009   5Gi        RWO            pd-ssd-wait     137m
pd-demo-pd-1       Pending                                                                        pd-ssd-wait     137m
pd-demo-pd-2       Pending                                                                        pd-ssd-wait     137m
cofyc commented 5 years ago

the probeTime of the pod is not recorded by the scheduler.

$ kubectl -n tidb7 get pods demo-pd-0  -ojsonpath='{.status}{"\n"}'
map[phase:Pending qosClass:Burstable]

you can follow the PR https://github.com/pingcap/tidb-operator/pull/611/files to fix the RBAC permission of the tidb-scheduler

gregwebs commented 5 years ago

Okay, thanks! Why doesn't an error show up in the scheduler when this happens?

gregwebs commented 5 years ago

We can close this out, but I think it is a bug if these issues don't show up in our logs.

cofyc commented 5 years ago

I've submitted a PR to log an error in kube-scheduler if the API request to update the status of the pod fails.

gregwebs commented 5 years ago

Thank you! One of the worst errors to deal with is silent permission issues. I am surprised that k8s is not already linting for unhandled errors.