neo4j / helm-charts

Apache License 2.0
59 stars 52 forks source link

Adds support for deploying StatefulSet with multiple Replicas to maintain cluster quorum during cluster updates or during deployments/upgrades of Neo4J clusters. #318

Closed ryanmcafee closed 7 months ago

ryanmcafee commented 7 months ago

Summary:

The current Neo4J Helm chart implementation hardcodes the StatefulSet replica count to 1 (see), limiting the ability to implement 0 downtime high availability with StatefulSets, rolling upgrades and use of Pod Disruption Budgets to minimize disruption and possible loss of Neo4J cluster quorum. This configuration as it stands with no action taken, will 100% lead to cluster outages and quorum violations when multiple Neo4J cluster pods are concurrently updated and cause complications during cluster maintenance. See this example Github project & branch that recreates this issue and provides step-by-step guidance on how to recreate the issue. See this example Github project & main branch for an example project that deploys a Neo4J cluster on AWS using a modified Neo4J helm chart to deploy multiple replicas per helm release to respect pod disruption budgets and maintain cluster quorum.

Pull Request

This pr should address this issue and be backwards compatible with existing consumers of the Neo4J helm chart.

319

Background:

The Neo4J Helm chart uses a StatefulSet with a hardcoded replica count of 1 (see) for deploying Neo4J Enterprise Casual clusters, which is identified in the chart's configuration here. This approach ensures that to configure a Neo4J cluster that the helm chart has to be deployed p times for primary nodes and s times for secondary nodes. This design decision from Neo4J can and will result in quorum violations during cluster updates and or voluntary disruptions resulting in outages, reliability issues and customer churn.

The pod disruption budget is a construct in Kubernetes introduced in version 1.21, that ensures that a minimum number of instances of an application will be running at a given time and will reject requests that violate the pod disruption budget.

Let’s look at a concrete example:

In a 3 primary node Neo4J Statefulset (desired - Neo4J helm chart doesn’t currently offer this), the PodDisruptionBudget should look something like:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001

where minAvailable=2

It should be noted, that by setting minAvailable=2 that cluster quorum can be maintained where cluster quorum is calculated as being a simple majority.

A simple majority is where availability is > 0.5.

In this scenario,

Kubernetes will ensure that our Pod Disruption Budget is respected, therefore, availability = 2 (total available) / 3 (total desired) = 0.67 = Quorum

What if the Neo4J helm chart is instead deployed 3 times each with 1 instance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-0-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001-0
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-1-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001-1
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-2-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001-2

With the above approach, this poses a new problem, this will actually block upgrades of things like Kubernetes node pool upgrades, so the deployment of PDBs will need to be disabled when Statefulset replicas = 1, otherwise a Kubernetes cluster operator will need to delete the PDBs during cluster maintenance.

With PodDisruptionBudgets disabled, there is no construct that prevents multiple Neo4J pods from being drained and rescheduled at the same time, which will result in the Quorum violation.

Helm Chart Example That Allows Multiple Replicas for StatefulSet (Grafana Labs - Mimir Distributed)

https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L868

https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/templates/ingester/ingester-statefulset.yaml#L17

Challenges:

  1. Quorum Violations: The hardcoded replica count prevents the effective use of Pod Disruption Budgets, risking quorum violations during updates.
  2. Limited High Availability: The inability to scale the number of replicas dynamically restricts the cluster's high availability capabilities.
  3. Operational Rigidity: The current configuration demands more cautious planning for updates and maintenance to avoid potential downtime.

Proposed Changes

  1. Modify the Helm chart to support configurable replica counts, removing the hardcoded limit and enabling dynamic scaling.
  2. Default the Neo4J helm chart's Statefulset replicas to default to 1, but allow for overrides via values.yaml, which will preserve backwards compatibility.
  3. Update the service discovery mechanism: when replicas > 1

Example - Loss of cluster quorum during updates

https://github.com/JupiterOne/provision-neo4j-cluster-k8s-aws-example/tree/feature/pod-disruption-violation-example-cluster-updates

Example - Proposed Changes - Working As Expected

https://github.com/JupiterOne/provision-neo4j-cluster-k8s-aws-example

harshitsinghvi22 commented 7 months ago

hello @ryanmcafee

Thank you so much for taking the time in coding the solution up and explaining it and sharing with us. We really appreciate it. However , these are the reasons why we haven't been able to support multiple replicas via a statefulset

The approach for a single statefulset for each neo4j instance allows you to use offlineMaintenanceMode and gives you more control on each of the neo4j instance.

Yes, the installation gets tricky with multiple helm installs , the PDB does not hold correct but these are the tradeoffs we have to currently make

We are working on overcoming this and will keep you updated once we have a solution in future.

Once again , thanks for raising this with us. Really appreciate you taking out time and raising a detailed PR with us.

Thanks, Harshit,Bledi

ryanmcafee commented 7 months ago

@harshitsinghvi22 thanks for the context that informed the design decision here. Can you add some documentation to the repository regarding this design decision and the various trade offs? Also, given this, what is Neo4J's recommendation to maintain cluster quorum and Neo4J cluster availability during Neo4J cluster updates/upgrades, Kubernetes cluster maintenance, etc. for Enterprise customers self-managing the Neo4J clusters?