Auto recovery after server crash.

piraeusdatastore / linstor-csi

CSI plugin for LINSTOR

Apache License 2.0

108 stars 27 forks source link

Auto recovery after server crash. #58

Open azalio opened 4 years ago

azalio commented 4 years ago

Hello! As I know, if my server completely broke, linstor plugin won't setup automatically another copy in other server and I need some manual action to recovery. Can you make it happen without my manual action? Unfortunately, I didn't work with linstor, but I know If I use ceph for example it will be done without my intervention.

rck commented 4 years ago

I'm not sure if this is something that even should be handled at this level. To me the CSI driver is pretty stupid and just reacts to attach/detach/what not. IMO it simply should not even try to magically reschedule stuff in the cluster. IMO it has to be told to do stuff. Maybe that is something for a k8s operator? @w00jay any opinion from a higher level k8s/operator point of view? Is this something the operator could handle (in the very long run) or am I wrong and that should be part of the CSI driver somehow?

w00jay commented 4 years ago

As @rck mentioned, the CSI driver does not, and cannot create a new volume or replica unless the underlying LINSTOR cluster is already provisioned. The CSI driver does not have any reactive capability.

Even w/ the current operator, volume placement on creation is at-best-effort w/o guarantee as this is the level of service provided by the CSI framework and the k8s. We are working toward resolving this issue at the operator in the long term, but I'm afraid this is not possible at this time.

azalio commented 4 years ago

Thank you for reply! As I understand, If my server died I will need do manual action and linstor operator can't help me with it?

w00jay commented 4 years ago

If a 'server dies' as in 'a k8s node fails,' AND does not come back, the current implementation of k8s and CSI framework most likely cannot deal w/ it very well. Most likely a statefulset controller cannot be sure if the node will ever come back, and will never drop the connection.

Our operator at this stage, is only concerned w/ registration of new LINSTOR storage nodes and deploying new PVs onto those nodes. An attached storage on a failed node even with the operator, cannot be moved to a new storage node w/o additional intervention in a k8s controller logic which does not exist in the LINSTOR operator at this time.

kvaps commented 4 years ago

I think this issue is more likely about auto-recovering feature for the linstor-controller, so better to move it to proper project:

https://github.com/LINBIT/linstor-server

kvaps commented 4 years ago

I guess this issue was fixed by implementing k8s-await-election for the linstor-controller, see https://github.com/piraeusdatastore/piraeus-operator/issues/56 and https://github.com/piraeusdatastore/piraeus-operator/pull/73