Rotating cloud instances with PVCs in a StatefulSet

joekohlsdorf commented 4 years ago

Online you can find a bunch of examples (even in the official docs) which show how to use the local-volume-provisioner in combination with PhysicalVolumeClaims in a StatefulSet.

All works fine until a node goes away and your cloud provider brings up a new one, be it due to an issue on their side or due to you bringing up some new nodes because you are upgrading K8s.

What happens in this case is that the PVC stays bound to a PV which no longer exists. This prohibits the pod in the StatefulSet from coming up until you manually delete the PVC. Now this makes sense because there is no way of knowing if the node was shut down for maintenance and if it will come back later or if it's gone forever.

However I'd just like the node to be assumed dead because I'm never going to reboot nodes intentionally, I'll just roll the cluster. If the pod can be scheduled on another node I know 100% that the node was replaced (due to my affinity settings). Is there any official way of dealing with this or any config option I'm overseeing?

I can write a job which takes care of this but surely others must have hit this issue?!

nerddelphi commented 4 years ago

@joekohlsdorf I guess we are facing the same issue: https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/issues/65#issuecomment-611887301

How do you plan solve that?

cofyc commented 4 years ago

There is no Kubernetes-official way right now because Kubernetes will not unbind or delete PVCs. It's up to the users to recover from this situation. I have a plan to write a cloud controller to handle this automatically.

When a new node with a different name is created to replace the old node (e.g. auto-scaling group in AWS), PVs belonging to the old node are invalid. PVCs must be deleted, then the scheduler can find feasible PVs on other nodes. By the way, if pods have already been recreated on node deletion and are stuck at pending, they must be recreated to trigger StatefulSet to create PVCs again.

In GKE, this is a little different because the managed instance group recreates the underlying instance but uses the old node name.

joekohlsdorf commented 4 years ago

What I did is I wrote a janitor which every 20 seconds looks for pending pods which have PVCs bound to PVs on dead hosts and will remove the PVC if necessary. It then deletes the pending pod to get it scheduled again.

My nodes for this service are static and have labels, this way I can be sure that the host isn't just rebooting. I know that my service runs on X nodes but if I see X nodes online and a PV on a node that doesn't exist I know it's dead and no coming back.

If this doesn't happen on GKE maybe some workaround could be found with custom node tags. You could have an ASG for every node so tags would stay the same even if a node dies.

nerddelphi commented 4 years ago

Thank for your answers, @cofyc and @joekohlsdorf .

@joekohlsdorf could you provide your janitor with us? I'll be glad if you can.

msau42 commented 4 years ago

@NickrenREN may have written a similar controller in the past

joekohlsdorf commented 4 years ago

Well I certainly would strongly advise against doing what I did but here is the unedited janitor I hacked up. Please only use it as a reference, I had to get this done in a time crunch. https://gist.github.com/joekohlsdorf/2658f03b1e1b6194ebe6b61bd8770647

nerddelphi commented 4 years ago

Hi, @joekohlsdorf . Thank you for script.

NickrenREN commented 4 years ago

There is an issue that is similiar to this. Some guys and I propose to introduce NodeFencing to solve this because it suit for both Cloud Providers and Bare metals and the reaction is relatively simple. But others decide to take NodeShutdown taint method, there is an ongoing proposal: https://github.com/kubernetes/enhancements/pull/1116.

Actually we have implemented NodeFencing feature (external controller and agent) in our own production environment.

nerddelphi commented 4 years ago

@NickrenREN Are you using that implementation https://github.com/kvaps/kube-fencing ? If yes, what kind of agent to dealing with PV/PVC issues? My clusters are on GKE.

Thank you.

NickrenREN commented 4 years ago

@nerddelphi No, our fencing controller and agent are implemented by ourselves. Agent is designed to shut down machines forcefully, the control logic, race conditions and cleanup work are done by controller.

NickrenREN commented 4 years ago

The design above is for bare metals, and for cloud providers, it may be a little bit different

rsoika commented 4 years ago

I am sorry that I am entering this discussion even though I am not a Kubernetes expert as you. But I have been dealing with this problem for a some weeks and I also followed this long running discussion.

I am running a simple Kubernetes Cluster with only a view Nodes. I guess this is a complete different environment as that ones you discuss here, but let me describe my scenario to give you different view on the problem:

I have setup a distributed storage based on Ceph or Longhorn (same behavior for both).
I deploy a PostgresDB using a persistence volume claim.
I kill (for testing) the node on which the Database POD is running
Now I run into this problem that Kubernetes gets stuck while restarting the database POD on a new node, because the broken POD get not detached from the volume.
I have to manually delete the volumeattachment to get rid of this situation

I understand all your concerns about the data and what can happen to it if a volume is automatically detached. But I - as the administrator of my cluster - trust in my Longhorn or Ceph Cluster. And of course, something can always go wrong, but that's my job to secure my data.

From my point of view, it is not Kubernetes' job to interfere in my data management. PLEASE give us a switch with which we can switch off this behavior and get terminating pods detached from a volume.

NickrenREN commented 4 years ago

@rsoika Thanks for your input. IIUIC, your scenario is the case NodeFencing can solve. If the node is dead (or Unknown), it will be forced to shut down and we do not expect it to be back again. As you described, data management isn't kubernetes' job, so the reaction is easy: go ahead and detach the volume forcefully. And of course, if you want to bring you node back, you need to do the cleanup work first (this is also the work of kubernetes relevant team).

rsoika commented 4 years ago

@NickrenREN Thanks for your clarification. So there is no self-healing mechanism in Kubernetes for this scenario?

NickrenREN commented 4 years ago

@rsoika For now, yes

NickrenREN commented 4 years ago

@rsoika Since the progress of "Node Shutdown Taint" feature is slow, we are considering creating new proposal and projects to opensource "NodeFencing" solution. It can be another option.

nerddelphi commented 4 years ago

@joekohlsdorf Hi.

Are yours PVs (bound to the deleted PVC) deleted as well? In my cluster (GKE) they are with status RELEASED, even after its PVC be deleted by a janitor and my StorageClass/ReclaimPolicy be DELETE.

Are you experiencing that behavior?

I guess I wouldn't billed for a non-existent localssd, so I should do a way to delete theses RELEASEDs PVs, also.

@cofyc @NickrenREN Is that behavior normal/expected? Shouldn't previous PVs be deleted automatically, once its PVCs don't exist anymore?

Thanks.

cofyc commented 4 years ago

If nodes which these PVs belong to do not exist anymore, you need to delete these PVs manually because no local-volume-provisioner can run on these nodes and recycle them.

NickrenREN commented 4 years ago

@nerddelphi For now, k8s controller will just send Delete events (setting deletion timestamp), and as @cofyc said, the drivers(or kubelet) on the broken node break down too, so it won't do the cleanup work. But with NodeFencing feature, these PVs can be released automatically (forcefully).

rsoika commented 4 years ago

Is the feature of NodeFencing official planned or is it still only in discussion? I found these projects that seems to address the problem:

https://github.com/kvaps/kube-fencing https://github.com/rootfs/node-fencing

NickrenREN commented 4 years ago

IIRC, NodeFencing was discussed before but we didn't reach an agreement 😓

nerddelphi commented 4 years ago

Ok, guys. Thank you!

nerddelphi commented 4 years ago

Ok, guys. Thank you!

rsoika commented 4 years ago

@NickrenREN can you share the discussion about the NodeFencing feature? I would like to better understand the backgrounds.

NickrenREN commented 4 years ago

It was originally discussed here: https://github.com/kubernetes/kubernetes/issues/65392 We also discussed it several times offline on slack.

And also, there are several KEPs there, but didn't get merged: https://github.com/kubernetes/community/pull/2763 https://github.com/kubernetes/community/pull/1416

We didn't reach an agreement, and if needed, i'd like to reopen the discussion.

rsoika commented 4 years ago

I can't believe that this is true. I invested so much time to migrate from docker-swarm to kubernetes. Now I had to learn that kubernetes is not a self-healing system as promoted form everywhere. I think I understand the discussion and concerns about the pros and cons very well. But I am personally not on a level that I can discuss this in the refered groups.

It is absolutely strange: I makes no sense to setup a Ceph cluster and connect it to my kubernetes cluster because of this limitation. I am running a small environment with about 100 PODs on 5 virtual nodes hosted by my cloud provider (Hetzner). I can be sure that if my cloud provider has a problem in one of its data centers (which are spread on different locations in Germany) my applications running on this node will stuck in termination state. My customers will call me because they can no longer work. I have to figure out all the affected volumeattachments and delete them manually. This is of course no solution. We are a small company with no 7x24 admin team.

My only hope is now that the Longhorn Team will solve this issue in there storage solution without the help from the kubernetes framework.

I can't believe that Kubernets is only focusing on stateless services.... I am not only talking about databases like postgres but also about services like Apache-Solr for fulltext search indexes or the Spacy project for ML-Services. All these services need in the end a data volume. If you see a way to re-energise this discussion, I would like to support you.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

cofyc commented 4 years ago

/remove-lifecycle stale /lifecycle fronzen

cofyc commented 4 years ago

/lifecycle frozen

oomichi commented 3 years ago

/cc @oomichi

eduardobr commented 2 years ago

Does this seem like a solution Azure Kubernetes Service implemented on their own Container Storage Interface (CSI)? https://azure.microsoft.com/da-dk/updates/public-preview-azure-disk-csi-driver-v2-in-aks/

https://github.com/kubernetes-sigs/azuredisk-csi-driver/tree/main_v2

kubernetes-sigs / sig-storage-local-static-provisioner

Rotating cloud instances with PVCs in a StatefulSet #181