gluster / anthill

A Kubernetes/OpenShift operator to manage Gluster clusters
http://gluster-anthill.readthedocs.io/
Apache License 2.0
35 stars 12 forks source link

Automatic replacement of failed nodes #20

Open JohnStrunk opened 6 years ago

JohnStrunk commented 6 years ago

Describe the feature you'd like to have. When a gluster pod fails, kube will attempt to restart it; if it was a simple crash or other transient problem, this should be sufficient to repair the system (plus automatic heal). However, if the node's state becomes corrupt or is lost, it may be necessary to remove the failed node from the cluster and potentially spawn a new one to take its place.

What is the value to the end user? (why is it a priority?) If a gluster node (pod) remains offline, the associated bricks will have a reduced level of availability & reliability. Being able to automatically repair failures will help increase system availability and protect users' data.

How will we know we have a good solution? (acceptance criteria)

Additional context This relies on the node state machine (#17) and an, as yet, unimplemented GD2 automigration plugin.