gluster / anthill

A Kubernetes/OpenShift operator to manage Gluster clusters
http://gluster-anthill.readthedocs.io/
Apache License 2.0
35 stars 12 forks source link

Pod disruption budget to manage availability #32

Open JohnStrunk opened 6 years ago

JohnStrunk commented 6 years ago

Describe the feature you'd like to have. The operator should maintain a pod disruption budget for the Gluster cluster pods to prevent voluntary disruptions from hurting service availability. After a Gluster pod is down for any reason, the data hosted on that pod will likely need to be healed before the next outage can be fully tolerated. Having a disruption budget will prevent kubernetes from voluntarily taking down a pod until the proper number are up and healthy.

What is the value to the end user? (why is it a priority?) Users expect storage to be continuously available, through both planned and unplanned events. Having properly maintained disruption budgets will prevent voluntary events (upgrades, etc.) from causing outages.

How will we know we have a good solution? (acceptance criteria)

Additional context This item will need some investigation (and may not actually be usable):

JohnStrunk commented 5 years ago

Upon further discussion, the above solution of having the PBD w/ min available of (n - 1) and using the pod readiness probe to indicate pending heals is flawed. By signaling "unready", the pod makes itself ineligible to receive traffic that is redirected via a service. Since we intend to use a service as the primary method for contacting the Gluster cluster, this is undesirable.

The suggested approach is to use the PDB's .spec.minAvailable field and have the operator manually toggle it between N (when there are pending heals) and N-1 (when all volumes are fully in-sync).