pingcap / tidb-operator

TiDB operator creates and manages TiDB clusters running in Kubernetes.
https://docs.pingcap.com/tidb-in-kubernetes/
Apache License 2.0
1.22k stars 493 forks source link

stability test: pd pod's pvc can't find a pv when failover #506

Closed weekface closed 5 years ago

weekface commented 5 years ago

Bug Report

What version of Kubernetes are you using?

What version of TiDB Operator are you using?

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

What's the status of the TiDB cluster pods?

What did you do?

the pod stability-cluster2-pd-3 is scheduled but its status is ContainerCreating:

[root@172.16.4.149 ~]# kubectl describe po -n stability-cluster2 stability-cluster2-pd-3
Name:               stability-cluster2-pd-3
Namespace:          stability-cluster2
Priority:           0
PriorityClassName:  <none>
Node:               172.16.4.153/172.16.4.153
Start Time:         Mon, 20 May 2019 16:17:46 +0800
Labels:             app.kubernetes.io/component=pd
                    app.kubernetes.io/instance=stability-cluster2
                    app.kubernetes.io/managed-by=tidb-operator
                    app.kubernetes.io/name=tidb-cluster
                    controller-revision-hash=stability-cluster2-pd-57d7c46cc6
                    statefulset.kubernetes.io/pod-name=stability-cluster2-pd-3
                    tidb.pingcap.com/cluster-id=6693005034414775405
Annotations:        pingcap.com/last-applied-configuration:
                      {"volumes":[{"name":"annotations","downwardAPI":{"items":[{"path":"annotations","fieldRef":{"fieldPath":"metadata.annotations"}}]}},{"name...
                    prometheus.io/path: /metrics
                    prometheus.io/port: 2379
                    prometheus.io/scrape: true
Status:             Pending
IP:
Controlled By:      StatefulSet/stability-cluster2-pd
Containers:
  pd:
    Container ID:
    Image:         pingcap/pd:v2.1.8
    Image ID:
    Ports:         2380/TCP, 2379/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /bin/sh
      /usr/local/bin/pd_start_script.sh
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     200m
      memory:  1Gi
    Environment:
      NAMESPACE:          stability-cluster2 (v1:metadata.namespace)
      PEER_SERVICE_NAME:  stability-cluster2-pd-peer
      SERVICE_NAME:       stability-cluster2-pd
      SET_NAME:           stability-cluster2-pd
      TZ:                 UTC
    Mounts:
      /etc/pd from config (ro)
      /etc/podinfo from annotations (ro)
      /usr/local/bin from startup-script (ro)
      /var/lib/pd from pd (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-rwcnf (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  pd:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pd-stability-cluster2-pd-3
    ReadOnly:   false
  annotations:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.annotations -> annotations
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      stability-cluster2-pd
    Optional:  false
  startup-script:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      stability-cluster2-pd
    Optional:  false
  default-token-rwcnf:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-rwcnf
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                  From                   Message
  ----     ------       ----                 ----                   -------
  Warning  FailedMount  10s (x489 over 18h)  kubelet, 172.16.4.153  Unable to mount volumes for pod "stability-cluster2-pd-3_stability-cluster2(bf19f179-7ad7-11e9-8177-52540005d356)": timeout expired waiting for volumes to attach or mount for pod "stability-cluster2"/"stability-cluster2-pd-3". list of unmounted volumes=[pd]. list of unattached volumes=[pd annotations config startup-script default-token-rwcnf]
[root@172.16.4.149 ~]# kubectl get po -n stability-cluster2 -owide
NAME                                           READY   STATUS              RESTARTS   AGE   IP               NODE           NOMINATED NODE
stability-cluster2-discovery-6d6489844-cm2gn   1/1     Running             0          18h   10.233.96.20     172.16.4.150   <none>
stability-cluster2-discovery-6d6489844-nwhmx   1/1     Unknown             0          19h   10.233.91.107    172.16.4.154   <none>
stability-cluster2-monitor-6b7cf97f57-dkc8q    2/2     Running             0          19h   10.233.96.44     172.16.4.150   <none>
stability-cluster2-pd-0                        1/1     Unknown             0          19h   10.233.91.114    172.16.4.154   <none>
stability-cluster2-pd-1                        1/1     Running             1          19h   10.233.112.231   172.16.4.149   <none>
stability-cluster2-pd-2                        1/1     Running             0          19h   10.233.96.15     172.16.4.150   <none>
stability-cluster2-pd-3                        0/1     ContainerCreating   0          18h   <none>           172.16.4.153   <none>
stability-cluster2-tidb-0                      1/1     Running             0          19h   10.233.96.40     172.16.4.150   <none>
stability-cluster2-tidb-1                      1/1     Running             0          19h   10.233.66.130    172.16.4.153   <none>
stability-cluster2-tikv-0                      1/1     Running             1          19h   10.233.96.63     172.16.4.150   <none>
stability-cluster2-tikv-1                      1/1     Unknown             0          19h   10.233.91.108    172.16.4.154   <none>
stability-cluster2-tikv-2                      1/1     Running             2          19h   10.233.66.142    172.16.4.153   <none>
stability-cluster2-tikv-3                      1/1     Running             0          18h   10.233.104.56    172.16.4.152   <none>

the local-volume-provisioner logs:

E0521 02:45:17.339023       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:45:27.340796       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:45:37.343010       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:45:47.344908       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:45:57.346534       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:46:07.348460       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:46:17.350197       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:46:27.353568       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:46:37.357148       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:46:47.360995       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:46:57.363059       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:47:07.365940       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:47:17.367875       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:47:27.369523       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:47:37.371369       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:47:47.373154       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:47:57.376735       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint
E0521 02:48:07.379129       1 discovery.go:246] Path "/mnt/disks/local-pv010" is not an actual mountpoint

@cofyc PTAL

What did you expect to see?

What did you see instead?

weekface commented 5 years ago

This may be a kube-scheduler race condition problem, but can't confirm now. We have change kube-scheduler logLevel to 4, if it happens again, we can get more detail log informations.

xiaojingchen commented 5 years ago

I think it is nothing to do with scheduler, just there's something wrong with the mount point. we need to check the mount point info.

shonge commented 5 years ago

检查一下 loccl-pv010 mountpoint是否存在?

  1. mountpoint /mnt/disks/local-pv010
  2. mount -l | grep "/mnt/disks/local-pv010"

Validate that this path is an actual mountpoint

weekface commented 5 years ago

@shonge thank you for your help.

weekface commented 5 years ago

/mnt/disks/local-pv010 is not an actual mountpoint