TIKV node loss leaves the TiKV pod in status pending

gregwebs commented 5 years ago

tidb-operator version: (v1.0.0-beta.1-p2)

I have encountered a TiKV node that is down and is not getting re-scheduled.

kubectl describe pod -n tidb60001 demo-tikv-2

Events:
  Type     Reason            Age                    From            Message
  ----     ------            ----                   ----            -------
  Warning  FailedScheduling  5m8s (x1856 over 76m)  tidb-scheduler  0/12 nodes are available: 1 node(s) had volume node affinity conflict, 11 Insufficient cpu, 9 Insufficient memory.
  Warning  FailedScheduling  8s (x71421 over 85m)   tidb-scheduler  0/11 nodes are available: 1 node(s) had volume node affinity conflict, 10 Insufficient cpu, 8 Insufficient memory.

kubectl describe pvc -n tidb60001 tikv-demo-tikv-2

Name:          tikv-demo-tikv-2
Namespace:     tidb60001
StorageClass:  local-storage
Status:        Bound
Volume:        local-pv-c042582b
Labels:        app.kubernetes.io/component=tikv
...
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               tidb.pingcap.com/pod-name: demo-tikv-2
               tidb.pingcap.com/pod-scheduling: 2019-04-16T16:59:45Z
Capacity:      368Gi
Access Modes:  RWO
Events:        <none>
Mounted By:    demo-tikv-2

kubectl describe pv local-pv-c042582b pv

Name:              local-pv-c042582b
Labels:            app.kubernetes.io/component=tikv
                   app.kubernetes.io/instance=tidb60001
                   app.kubernetes.io/managed-by=tidb-operator
                   app.kubernetes.io/name=tidb-cluster
                   app.kubernetes.io/namespace=tidb60001
                   tidb.pingcap.com/cluster-id=6680536698049175312
                   tidb.pingcap.com/store-id=6
Annotations:       pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: local-volume-provisioner-gke-beta-tidb-n1-standard-4-375-33507f39-rmkn-c233be3c-6068-11e9-9941-4201ac1f400a
                   tidb.pingcap.com/pod-name: demo-tikv-2
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      local-storage
Status:            Bound
Claim:             tidb60001/tikv-demo-tikv-2
Reclaim Policy:    Retain
Access Modes:      RWO
Capacity:          368Gi
Node Affinity:
  Required Terms:
    Term 0:        kubernetes.io/hostname in [gke-beta-tidb-n1-standard-4-375-33507f39-rmkn]
Message:
Source:
    Type:  LocalVolume (a persistent volume backed by local storage on a node)
    Path:  /mnt/disks/ssd0
Events:    <none>

The node gke-beta-tidb-n1-standard-4-375-33507f39-5mjk exists and has a local SSD. I can confirm that this node was created more recently then the other two, meaning that the TiKV going down and not being re-scheduled happened due to a node failure.

aylei commented 5 years ago

IIRC, this is the intended behavior for local PV. To re-schedule, the controller has to delete the PVC that bound to the pod, which causes data loss. So we rely on auto-failover to create new pods and meet the available replicas number spec.

@tennix @weekface can explain this better.

gregwebs commented 5 years ago

In the cloud, where failure is normal, the current behavior isn't really usable. I will try setting autoFailover then.

gregwebs commented 5 years ago

autoFailover=true does start up a new pod! It didn't delete the existing unscheduled pod. After deleting the PVC for the unscheduleable pod as you suggested, the previously unschedulable pod is scheduled but crashes with an error duplicated store address. It seems there are no docs on this mode of operation, and I can't figure out how to get rid of the failing pod.

weekface commented 5 years ago

What is the content of /etc/fstab of the failed node.

weekface commented 5 years ago

It seems the same with: https://github.com/pingcap/tidb-operator/issues/385

aylei commented 5 years ago

It seems the same with: #385

In this case the PVC is deleted manually and pod is scheduled to another node, binding a new (empty) PV, so we lose the data by nature. Seems that we cannot re-schedule a pod while it's local PV is unavailable (for example, node loss) now.

weekface commented 5 years ago

Yes, you can't re-schedule the failed tikv pod before it becomes tombstone

gregwebs commented 5 years ago

With autoFailover=true set the whole time I am now seeing the pod getting re-scheduled as it should without creating any new pods. However, when the pod is re-scheduled onto the new node it fails with the duplicated store address.

2019/04/17 23:11:08.233 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("duplicated store address: id:64 address:\"demo-tikv-1.demo-tikv-peer.tidb120002.svc:20160\" version:\"2.1.3\" , already registered by id:4 address:\"demo-tikv-1.demo-tikv-peer.tidb120002.svc:20160\" ...

gregwebs commented 5 years ago

To reproduce this, just run gcloud compute instances list and then gcloud compute instances delete <tikv-node>.

tennix commented 5 years ago

OK. I've figured it out.

Suppose the cluster has tikv-0, tikv-1, tikv-2, and when you delete a node, for example tikv-2 node, then a new pod tikv-3 will be created and running correctly if auto-failover is enabled.

The tikv-2 pod will be in the unknown state, and k8s keeps the pod meta info in etcd until the kubelet comes back and automatically delete it or you delete it forcibly via kubectl.

However, when deleting it forcibly via kubectl, a new tikv-2 pod will be created automatically by the statefulset controller. And the pod will be scheduled to the deleted node since the PV and PVC still exists and bond to the deleted node.

After you followed @aylei 's advice to delete the PVC via kubectl, the tikv-2 pod will still be in unschedulable state (Pending) because PVC will not be auto recreated by statefulset controller because the controller always creates the corresponding PVC before the pod is created and will not ensure PVC exists if pod already exists.

So to make the tikv-2 pod scheduled again, you have to delete the tikv-2 pod again to allow the statefulset controller auto creates PVC and then create tikv-2 pod.

I believe the above steps are what you did. The duplicated store address problem occurs because the tikv-2 address is still actively in use in PD bound to the failed store id. The PD refuses to re-register the address with another store id. So there needs another step, marking the store id tombstone (offline the store using pd-ctl or restful API), this allows the tikv-2 address reusable for new store id.

We did all the above steps in our previous tidb-operator automatically, but the actual situation is complicated and we cannot make sure what really happens to the failed TiKV. Docker and kubelet or even network fail making the pod status not reflecting the real tikv status, offline the store and remove all the underlying data automatically is dangerous. So we make the conservative choice right now, we can add more automatic operations in the future when we have confidence in these specific scenarios.

gregwebs commented 5 years ago

Yes, I am now trying to figure out how to make things work with the autoFailover setting on. I guess I don't know what autoFailover is supposed to do. It already re-schedules the pod with new storage. By doing this and retaining the old store id it seems to be creating a broken situation rather than actually avoiding dangerous situations.

Behavior that would make more sense:

The pod is re-scheduled with a new store address when it has new storage
The old pod is retained in an unscheduleable state and a new pod is scheduled (with a new address)

What do you think?

tennix commented 5 years ago

@gregwebs It's not in a broken situation. By auto failover, the tidb-operator adds a new tikv pod, and the service is not impacted. The really bothering thing is there leaves a pending pod which requires manual intervention. It helps that the dba doesn't have to handle it immediately.

The newly created pod already gets a new store address and new storage, namely tikv-4 in the above example. We can not determine and make the choice to automatically make tikv-2 magically work again. This part requires manual intervention. There is no safe solution right now.

gregwebs commented 5 years ago

This issue is now confusing because the scenario has changed since I started consistently setting autoFailover=true. I opened a new issue #408 that only describes the issue with autoFailover always on.

pingcap / tidb-operator

TIKV node loss leaves the TiKV pod in status pending #402