ryanmcafee commented 7 months ago

Summary:

The current Neo4J Helm chart implementation hardcodes the StatefulSet replica count to 1 (see), limiting the ability to implement 0 downtime high availability with StatefulSets, rolling upgrades and use of Pod Disruption Budgets to minimize disruption and possible loss of Neo4J cluster quorum. This configuration as it stands with no action taken, will 100% lead to cluster outages and quorum violations when multiple Neo4J cluster pods are concurrently updated and cause complications during cluster maintenance. See this example Github project & branch that recreates this issue and provides step-by-step guidance on how to recreate the issue. See this example Github project & main branch for an example project that deploys a Neo4J cluster on AWS using a modified Neo4J helm chart to deploy multiple replicas per helm release to respect pod disruption budgets and maintain cluster quorum.

Pull Request

This pr should address this issue and be backwards compatible with existing consumers of the Neo4J helm chart.

319 Background:

The Neo4J Helm chart uses a StatefulSet with a hardcoded replica count of 1 (see) for deploying Neo4J Enterprise Casual clusters, which is identified in the chart's configuration here. This approach ensures that to configure a Neo4J cluster that the helm chart has to be deployed p times for primary nodes and s times for secondary nodes. This design decision from Neo4J can and will result in quorum violations during cluster updates and or voluntary disruptions resulting in outages, reliability issues and customer churn.

The pod disruption budget is a construct in Kubernetes introduced in version 1.21, that ensures that a minimum number of instances of an application will be running at a given time and will reject requests that violate the pod disruption budget.

Let’s look at a concrete example:

In a 3 primary node Neo4J Statefulset (desired - Neo4J helm chart doesn’t currently offer this), the PodDisruptionBudget should look something like:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001

where minAvailable=2

It should be noted, that by setting minAvailable=2 that cluster quorum can be maintained where cluster quorum is calculated as being a simple majority.

A simple majority is where availability is > 0.5.

In this scenario,

Kubernetes will ensure that our Pod Disruption Budget is respected, therefore, availability = 2 (total available) / 3 (total desired) = 0.67 = Quorum

What if the Neo4J helm chart is instead deployed 3 times each with 1 instance:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-0-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001-0

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-1-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001-1

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: neo4j-cluster-001-2-pdb
  namespace: neo4j-cluster-001
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: neo4j-cluster-001
      helm.neo4j.com/instance: neo4j-cluster-001-2

With the above approach, this poses a new problem, this will actually block upgrades of things like Kubernetes node pool upgrades, so the deployment of PDBs will need to be disabled when Statefulset replicas = 1, otherwise a Kubernetes cluster operator will need to delete the PDBs during cluster maintenance.

With PodDisruptionBudgets disabled, there is no construct that prevents multiple Neo4J pods from being drained and rescheduled at the same time, which will result in the Quorum violation.

Helm Chart Example That Allows Multiple Replicas for StatefulSet (Grafana Labs - Mimir Distributed)

https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L868

https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/templates/ingester/ingester-statefulset.yaml#L17

Challenges:

Quorum Violations: The hardcoded replica count prevents the effective use of Pod Disruption Budgets, risking quorum violations during updates.
Limited High Availability: The inability to scale the number of replicas dynamically restricts the cluster's high availability capabilities.
Operational Rigidity: The current configuration demands more cautious planning for updates and maintenance to avoid potential downtime.

Proposed Changes

Modify the Helm chart to support configurable replica counts, removing the hardcoded limit and enabling dynamic scaling.
Default the Neo4J helm chart's Statefulset replicas to default to 1, but allow for overrides via values.yaml, which will preserve backwards compatibility.
Update the service discovery mechanism: when replicas > 1

Deploy service per pod to allow Raft discovery and integration with existing service discovery mechanism with no code changes required in Neo4J
Deploy existing services (inclusive of all pod endpoints for compatibility with other existing Neo4J helm charts)

else:
Deploy existing services

Example - Loss of cluster quorum during updates

https://github.com/JupiterOne/provision-neo4j-cluster-k8s-aws-example/tree/feature/pod-disruption-violation-example-cluster-updates

Example - Proposed Changes - Working As Expected

https://github.com/JupiterOne/provision-neo4j-cluster-k8s-aws-example

harshitsinghvi22 commented 7 months ago

hello @ryanmcafee

Thank you so much for taking the time in coding the solution up and explaining it and sharing with us. We really appreciate it. However , these are the reasons why we haven't been able to support multiple replicas via a statefulset

With multiple replicas in a single statefulset , we cannot provide support for offlineMaintenanceMode (Neo4j product feature) . You can read more about it here
- offlineMaintenanceMode allows you to take down a SPECIFIC neo4j process ,pod is running not in ready state, perform offline operations (taking offline backup , offline load , offline copy) and get the same instance up
- with multiple replicas ,we cannot take down a specific neo4j instance down
With multiple replicas in a single statefulset , you cannot apply specific configuration (if needed) to a single individual instance, the config gets propagated through all the replicas
- all the replicas will share the same config making it difficult to apply configurations to a single instance out of many replicas
- Achieving fault tolerance can be tricky in scenarios where one of the cluster member (say replica - 3) is down , leaving you with 2 active members
- The moment you push a change in config to the statefulset to fix this ,another member goes down for the change to take place hence destroying the quorum (inital of 2 members)

The approach for a single statefulset for each neo4j instance allows you to use offlineMaintenanceMode and gives you more control on each of the neo4j instance.

Yes, the installation gets tricky with multiple helm installs , the PDB does not hold correct but these are the tradeoffs we have to currently make

We are working on overcoming this and will keep you updated once we have a solution in future.

Once again , thanks for raising this with us. Really appreciate you taking out time and raising a detailed PR with us.

Thanks, Harshit,Bledi

ryanmcafee commented 7 months ago

@harshitsinghvi22 thanks for the context that informed the design decision here. Can you add some documentation to the repository regarding this design decision and the various trade offs? Also, given this, what is Neo4J's recommendation to maintain cluster quorum and Neo4J cluster availability during Neo4J cluster updates/upgrades, Kubernetes cluster maintenance, etc. for Enterprise customers self-managing the Neo4J clusters?

neo4j / helm-charts

Adds support for deploying StatefulSet with multiple Replicas to maintain cluster quorum during cluster updates or during deployments/upgrades of Neo4J clusters. #318