pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

E2E Test failure #544

Closed tkanng closed 5 years ago

tkanng commented 5 years ago

Bug Report

What version of Kubernetes are you using?

root@iZhp37kmiszbkwzt5oh9csZ:~/k# kubectl version
Client Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.0-alpha.1.222+0b8566f3887a2d", GitCommit:"0b8566f3887a2d13baba623d88aeb64f6e637f46", GitTreeState:"clean", BuildDate:"2019-05-04T08:58:31Z", GoVersion:"go1.11.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}

What version of TiDB Operator are you using?

root@iZhp37kmiszbkwzt5oh9csZ:~/k# kubectl  exec tidb-controller-manager-6c4dc9fbcf-fwfs4  -n pingcap  -- tidb-controller-manager -V
TiDB Operator Version: version.Info{TiDBVersion:"2.1.0", GitVersion:"v1.0.0-beta.2.34+37d693366bc562", GitCommit:"37d693366bc5622c5975de08fbb4df6fc78a9000", GitTreeState:"clean", BuildDate:"2019-05-31T05:37:32Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

root@iZhp37kmiszbkwzt5oh9csZ:~/k# kubectl get pvc -n e2e-cluster1
NAME                       STATUS   VOLUME              CAPACITY   ACCESS MODES   STORAGECLASS    AGE
pd-e2e-cluster1-pd-0       Bound    local-pv-9371219a   39Gi       RWO            local-storage   20m
pd-e2e-cluster1-pd-1       Bound    local-pv-c52702d3   39Gi       RWO            local-storage   20m
pd-e2e-cluster1-pd-2       Bound    local-pv-6635374b   39Gi       RWO            local-storage   20m
tikv-e2e-cluster1-tikv-0   Bound    local-pv-b3d1a2e9   39Gi       RWO            local-storage   19m
tikv-e2e-cluster1-tikv-1   Bound    local-pv-4982aa4a   39Gi       RWO            local-storage   19m
tikv-e2e-cluster1-tikv-2   Bound    local-pv-c4277489   39Gi       RWO            local-storage   19m

What's the status of the TiDB cluster pods?

root@iZhp37kmiszbkwzt5oh9csZ:~/k# kubectl get pod -n e2e-cluster2
NAME                                           READY   STATUS              RESTARTS   AGE
e2e-cluster2-discovery-5bc5945fb6-47jr4        1/1     Running             0          21m
e2e-cluster2-monitor-7875db8d87-45jvj          2/2     Running             0          21m
e2e-cluster2-pd-0                              1/1     Running             1          21m
e2e-cluster2-pd-1                              1/1     Running             0          21m
e2e-cluster2-pd-2                              1/1     Running             1          21m
e2e-cluster2-tidb-0                            1/1     Running             0          19m
e2e-cluster2-tidb-1                            1/1     Running             0          19m
e2e-cluster2-tidb-initializer-mtgxg            0/1     Completed           4          21m
e2e-cluster2-tikv-0                            1/1     Running             0          20m
e2e-cluster2-tikv-1                            1/1     Running             0          20m
e2e-cluster2-tikv-2                            1/1     Running             0          20m
e2e-pd-replicas-1-discovery-7995fc7fcb-qg7cn   1/1     Running             0          21m
e2e-pd-replicas-1-pd-0                         1/1     Running             0          21m
e2e-pd-replicas-1-tidb-0                       1/1     Running             0          20m
e2e-pd-replicas-1-tidb-1                       1/1     Running             0          20m
e2e-pd-replicas-1-tidb-initializer-4dzlz       0/1     Completed           3          21m
e2e-pd-replicas-1-tikv-0                       0/1     ContainerCreating   0          21m
e2e-pd-replicas-1-tikv-1                       1/1     Running             0          21m
e2e-pd-replicas-1-tikv-2                       1/1     Running             0          21m

What did you do?

Run e2e test in Docker in Docker Kubernetes

What did you expect to see?

All clusters work.

What did you see instead?

e2e-pd-replicas-1-tikv-0 filed to start, tidb-scheduler assigned this pod to kube-node-1, but the local pv local-pv-d20c2706 is on kube-node3 . Here is the output of kubectl describe pod e2e-pd-replicas-1-tikv-0 -n e2e-cluster2:

Name:               e2e-pd-replicas-1-tikv-0
Namespace:          e2e-cluster2
Priority:           0
PriorityClassName:  <none>
Node:               kube-node-1/10.192.0.3
Start Time:         Sun, 02 Jun 2019 13:44:15 +0800
Labels:             app.kubernetes.io/component=tikv
                    app.kubernetes.io/instance=e2e-pd-replicas-1
                    app.kubernetes.io/managed-by=tidb-operator
                    app.kubernetes.io/name=tidb-cluster
                    controller-revision-hash=e2e-pd-replicas-1-tikv-6bbccdf9df
                    statefulset.kubernetes.io/pod-name=e2e-pd-replicas-1-tikv-0
                    tidb.pingcap.com/cluster-id=6697805017831139367
Annotations:        pingcap.com/last-applied-configuration:
                      {"volumes":[{"name":"annotations","downwardAPI":{"items":[{"path":"annotations","fieldRef":{"fieldPath":"metadata.annotations"}}]}},{"name...
                    prometheus.io/path: /metrics
                    prometheus.io/port: 20180
                    prometheus.io/scrape: true
Status:             Pending
IP:                 
Controlled By:      StatefulSet/e2e-pd-replicas-1-tikv
Containers:
  tikv:
    Container ID:  
    Image:         pingcap/tikv:v3.0.0-beta.1
    Image ID:      
    Port:          20160/TCP
    Host Port:     0/TCP
    Command:
      /bin/sh
      /usr/local/bin/tikv_start_script.sh
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      NAMESPACE:              e2e-cluster2 (v1:metadata.namespace)
      CLUSTER_NAME:           e2e-pd-replicas-1
      HEADLESS_SERVICE_NAME:  e2e-pd-replicas-1-tikv-peer
      CAPACITY:               0
      TZ:                     UTC
    Mounts:
      /etc/podinfo from annotations (ro)
      /etc/tikv from config (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/tikv from tikv (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-l752v (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  tikv:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  tikv-e2e-pd-replicas-1-tikv-0
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      e2e-pd-replicas-1-tikv
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      e2e-pd-replicas-1-tikv
    Optional:  false
  default-token-l752v:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-l752v
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                      From                  Message
  ----     ------            ----                     ----                  -------
  Normal   Scheduled         6m59s                    tidb-scheduler        Successfully assigned e2e-cluster2/e2e-pd-replicas-1-tikv-0 to kube-node-1
  Warning  FailedScheduling  6m58s                    tidb-scheduler        failed to get cached bindings for pod "e2e-cluster2/e2e-pd-replicas-1-tikv-0"
  Warning  FailedMount       119s (x2991 over 6m59s)  kubelet, kube-node-1  MountVolume.NodeAffinity check failed for volume "local-pv-d20c2706" : No matching NodeSelectorTerms

Here is the output of kubectl get pv local-pv-d20c2706 -oyaml:

root@iZhp37kmiszbkwzt5oh9csZ:~/k# kubectl get pv local-pv-d20c2706 -oyaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: local-volume-provisioner-kube-node-3-a513d2d4-81cb-11e9-8ace-024217bd333d
    tidb.pingcap.com/pod-name: e2e-pd-replicas-1-tikv-0
  creationTimestamp: "2019-06-02T05:36:48Z"
  finalizers:
  - kubernetes.io/pv-protection
  labels:
    app.kubernetes.io/component: tikv
    app.kubernetes.io/instance: e2e-pd-replicas-1
    app.kubernetes.io/managed-by: tidb-operator
    app.kubernetes.io/name: tidb-cluster
    app.kubernetes.io/namespace: e2e-cluster2
    kubernetes.io/hostname: kube-node-3
    tidb.pingcap.com/cluster-id: "6697805017831139367"
  name: local-pv-d20c2706
  resourceVersion: "843259"
  selfLink: /api/v1/persistentvolumes/local-pv-d20c2706
  uid: 6a4e6fb6-84f8-11e9-8ace-024217bd333d
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 39Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: tikv-e2e-pd-replicas-1-tikv-0
    namespace: e2e-cluster2
    resourceVersion: "842958"
    uid: 74a7cbda-84f9-11e9-8ace-024217bd333d
  local:
    path: /mnt/disks/vol9
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - kube-node-3
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-storage
status:
  phase: Bound

Here is the output of kubectl get pv before the e2e test:

NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS    REASON   AGE
local-pv-1767facf   39Gi       RWO            Delete           Available           local-storage            7m20s
local-pv-1dbd65bc   39Gi       RWO            Delete           Available           local-storage            6m50s
local-pv-2a83a815   39Gi       RWO            Delete           Available           local-storage            78s
local-pv-37c274de   39Gi       RWO            Delete           Available           local-storage            80s
local-pv-3ece239f   39Gi       RWO            Delete           Available           local-storage            7m28s
local-pv-45324aa3   39Gi       RWO            Delete           Available           local-storage            35m
local-pv-4982aa4a   39Gi       RWO            Delete           Available           local-storage            36m
local-pv-62446ab1   39Gi       RWO            Delete           Available           local-storage            50s
local-pv-6635374b   39Gi       RWO            Delete           Available           local-storage            30s
local-pv-66d8020f   39Gi       RWO            Delete           Available           local-storage            6m48s
local-pv-67e0e52d   39Gi       RWO            Delete           Available           local-storage            6m30s
local-pv-7e1a02ed   39Gi       RWO            Delete           Available           local-storage            6m28s
local-pv-820ea0a0   39Gi       RWO            Delete           Available           local-storage            62m
local-pv-82ff2b19   39Gi       RWO            Delete           Available           local-storage            6m30s
local-pv-8a0a2eb0   39Gi       RWO            Delete           Available           local-storage            28s
local-pv-8ebb22ac   39Gi       RWO            Delete           Available           local-storage            6m48s
local-pv-914e5926   39Gi       RWO            Delete           Available           local-storage            36m
local-pv-9371219a   39Gi       RWO            Delete           Available           local-storage            6m28s
local-pv-a6e4a208   39Gi       RWO            Delete           Available           local-storage            35m
local-pv-b3d1a2e9   39Gi       RWO            Delete           Available           local-storage            4d
local-pv-bf3146fc   39Gi       RWO            Delete           Available           local-storage            18s
local-pv-c4277489   39Gi       RWO            Delete           Available           local-storage            7m28s
local-pv-c52702d3   39Gi       RWO            Delete           Available           local-storage            7m28s
local-pv-cfa833c6   39Gi       RWO            Delete           Available           local-storage            78s
local-pv-d20c2706   39Gi       RWO            Delete           Available           local-storage            6m50s
local-pv-d4b44d8    39Gi       RWO            Delete           Available           local-storage            47m
local-pv-e62c3e22   39Gi       RWO            Delete           Available           local-storage            8s
local-pv-e93f8428   39Gi       RWO            Delete           Available           local-storage            0s
local-pv-f1f39fe7   39Gi       RWO            Delete           Available           local-storage            6m48s
local-pv-f2cc9d77   39Gi       RWO            Delete           Available           local-storage            78s

Is there something wrong with tidb-scheduler?

:)

weekface commented 5 years ago

@cofyc Can you take a look at it?

cofyc commented 5 years ago

In tidb-scheduler pod, it's kube-scheduler to schedule the pod. It seems a race condition in kube-scheduler 1.12.1.

cofyc commented 5 years ago

It's complex. FYI, I explained the races on the pod binding cache in pre-1.14 Kubernetes here.

tkanng commented 5 years ago

Thank you so much! :+1:

weekface commented 5 years ago

duplicated with https://github.com/pingcap/tidb-operator/issues/602#issuecomment-511317477, closing this issue.