Closed gregwebs closed 5 years ago
IIRC, this is the intended behavior for local PV. To re-schedule, the controller has to delete the PVC that bound to the pod, which causes data loss. So we rely on auto-failover
to create new pods and meet the available replicas number spec.
@tennix @weekface can explain this better.
In the cloud, where failure is normal, the current behavior isn't really usable. I will try setting autoFailover
then.
autoFailover=true
does start up a new pod! It didn't delete the existing unscheduled pod. After deleting the PVC for the unscheduleable pod as you suggested, the previously unschedulable pod is scheduled but crashes with an error duplicated store address
.
It seems there are no docs on this mode of operation, and I can't figure out how to get rid of the failing pod.
What is the content of /etc/fstab
of the failed node.
It seems the same with: https://github.com/pingcap/tidb-operator/issues/385
It seems the same with: #385
In this case the PVC is deleted manually and pod is scheduled to another node, binding a new (empty) PV, so we lose the data by nature. Seems that we cannot re-schedule a pod while it's local PV is unavailable (for example, node loss) now.
Yes, you can't re-schedule the failed tikv pod before it becomes tombstone
With autoFailover=true
set the whole time I am now seeing the pod getting re-scheduled as it should without creating any new pods. However, when the pod is re-scheduled onto the new node it fails with the duplicated store address.
2019/04/17 23:11:08.233 ERRO util.rs:336: fail to request: Grpc(RpcFailure(RpcStatus { status: Unknown, details: Some("duplicated store address: id:64 address:\"demo-tikv-1.demo-tikv-peer.tidb120002.svc:20160\" version:\"2.1.3\" , already registered by id:4 address:\"demo-tikv-1.demo-tikv-peer.tidb120002.svc:20160\" ...
To reproduce this, just run gcloud compute instances list
and then gcloud compute instances delete <tikv-node>
.
OK. I've figured it out.
Suppose the cluster has tikv-0, tikv-1, tikv-2, and when you delete a node, for example tikv-2 node, then a new pod tikv-3 will be created and running correctly if auto-failover
is enabled.
The tikv-2 pod will be in the unknown state, and k8s keeps the pod meta info in etcd until the kubelet comes back and automatically delete it or you delete it forcibly via kubectl.
However, when deleting it forcibly via kubectl, a new tikv-2 pod will be created automatically by the statefulset controller. And the pod will be scheduled to the deleted node since the PV and PVC still exists and bond to the deleted node.
After you followed @aylei 's advice to delete the PVC via kubectl, the tikv-2 pod will still be in unschedulable state (Pending
) because PVC will not be auto recreated by statefulset controller because the controller always creates the corresponding PVC before the pod is created and will not ensure PVC exists if pod already exists.
So to make the tikv-2 pod scheduled again, you have to delete the tikv-2 pod again to allow the statefulset controller auto creates PVC and then create tikv-2 pod.
I believe the above steps are what you did. The duplicated store address problem occurs because the tikv-2 address is still actively in use in PD bound to the failed store id. The PD refuses to re-register the address with another store id. So there needs another step, marking the store id tombstone (offline the store using pd-ctl or restful API), this allows the tikv-2 address reusable for new store id.
We did all the above steps in our previous tidb-operator automatically, but the actual situation is complicated and we cannot make sure what really happens to the failed TiKV. Docker and kubelet or even network fail making the pod status not reflecting the real tikv status, offline the store and remove all the underlying data automatically is dangerous. So we make the conservative choice right now, we can add more automatic operations in the future when we have confidence in these specific scenarios.
Yes, I am now trying to figure out how to make things work with the autoFailover setting on. I guess I don't know what autoFailover is supposed to do. It already re-schedules the pod with new storage. By doing this and retaining the old store id it seems to be creating a broken situation rather than actually avoiding dangerous situations.
Behavior that would make more sense:
What do you think?
@gregwebs It's not in a broken situation. By auto failover, the tidb-operator adds a new tikv pod, and the service is not impacted. The really bothering thing is there leaves a pending pod which requires manual intervention. It helps that the dba doesn't have to handle it immediately.
The newly created pod already gets a new store address and new storage, namely tikv-4 in the above example. We can not determine and make the choice to automatically make tikv-2 magically work again. This part requires manual intervention. There is no safe solution right now.
This issue is now confusing because the scenario has changed since I started consistently setting autoFailover=true. I opened a new issue #408 that only describes the issue with autoFailover always on.
tidb-operator version: (v1.0.0-beta.1-p2)
I have encountered a TiKV node that is down and is not getting re-scheduled.
kubectl describe pod -n tidb60001 demo-tikv-2
kubectl describe pvc -n tidb60001 tikv-demo-tikv-2
kubectl describe pv local-pv-c042582b pv
The node gke-beta-tidb-n1-standard-4-375-33507f39-5mjk exists and has a local SSD. I can confirm that this node was created more recently then the other two, meaning that the TiKV going down and not being re-scheduled happened due to a node failure.