hetznercloud / csi-driver

Kubernetes Container Storage Interface driver for Hetzner Cloud Volumes
MIT License
646 stars 103 forks source link

volume not being reattached to healthy node when initial node shutdown #720

Open evgenii-avdiukhin opened 2 months ago

evgenii-avdiukhin commented 2 months ago

TL;DR

I have configure csi-driver and deployed jenkins statefullset to test the volume was automatically created and attached to worker-1 jenkins pod then was scheduled on the same node then i wanted to test how reattachment works i shutdown worker-1 hetzner vm but nothing happened, volume is not being reattached since tolerations are configured, jenkins pod is terminating and then try to schedule on the node that has the pvc, but he cant because pvc is still on the dead node what do i do wrong? or this behaviour is not supported by csi-driver?

Expected behavior

hetzne volume is moved to healthy node and pod schedule successfully

Observed behavior

volume is not being reattached

Minimal working example

No response

Log output

No response

Additional information

No response

mpepping commented 2 months ago

By design, StatefulSet pods do not get rescheduled to a new node when the original node becomes unavailable. This is because Kubernetes does not distinguish between a deliberate shutdown and a network partition, so it marks the pods on the down node as Unknown rather than deleting them. That is what you see when power-off/shutdown a node. It rewquires manual rescheduling in case of a StatefulSet.

However if you do a drain or delete of the node running the Jenkins pod, it all works as you may expect. The behavior is the most responsive when draining or deleting nodes. Some 'exclusively attached' events on the workload, but all in all the PVC re-attaches in a reasonable time:

Normal   Scheduled               22s   default-scheduler        Successfully assigned jenkins/jenkins-0 to dev-pool-small-static-worker2
Warning  FailedAttachVolume      23s   attachdetach-controller  Multi-Attach error for volume "pvc-8b1a23a1-cc85-4b09-9231-2c963885e366" Volume is already exclusively attached to one node and can't be attached to another 
Normal   SuccessfulAttachVolume  0s    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-8b1a23a1-cc85-4b09-9231-2c963885e366"