Closed ryanmcafee closed 7 months ago
hello @ryanmcafee
Thank you so much for taking the time in coding the solution up and explaining it and sharing with us. We really appreciate it. However , these are the reasons why we haven't been able to support multiple replicas via a statefulset
With multiple replicas in a single statefulset , we cannot provide support for offlineMaintenanceMode (Neo4j product feature) . You can read more about it here
With multiple replicas in a single statefulset , you cannot apply specific configuration (if needed) to a single individual instance, the config gets propagated through all the replicas
The approach for a single statefulset for each neo4j instance allows you to use offlineMaintenanceMode and gives you more control on each of the neo4j instance.
Yes, the installation gets tricky with multiple helm installs , the PDB does not hold correct but these are the tradeoffs we have to currently make
We are working on overcoming this and will keep you updated once we have a solution in future.
Once again , thanks for raising this with us. Really appreciate you taking out time and raising a detailed PR with us.
Thanks, Harshit,Bledi
@harshitsinghvi22 thanks for the context that informed the design decision here. Can you add some documentation to the repository regarding this design decision and the various trade offs? Also, given this, what is Neo4J's recommendation to maintain cluster quorum and Neo4J cluster availability during Neo4J cluster updates/upgrades, Kubernetes cluster maintenance, etc. for Enterprise customers self-managing the Neo4J clusters?
Summary:
The current Neo4J Helm chart implementation hardcodes the StatefulSet replica count to 1 (see), limiting the ability to implement 0 downtime high availability with StatefulSets, rolling upgrades and use of Pod Disruption Budgets to minimize disruption and possible loss of Neo4J cluster quorum. This configuration as it stands with no action taken, will 100% lead to cluster outages and quorum violations when multiple Neo4J cluster pods are concurrently updated and cause complications during cluster maintenance. See this example Github project & branch that recreates this issue and provides step-by-step guidance on how to recreate the issue. See this example Github project & main branch for an example project that deploys a Neo4J cluster on AWS using a modified Neo4J helm chart to deploy multiple replicas per helm release to respect pod disruption budgets and maintain cluster quorum.
Pull Request
This pr should address this issue and be backwards compatible with existing consumers of the Neo4J helm chart.
319
Background:
The Neo4J Helm chart uses a StatefulSet with a hardcoded replica count of 1 (see) for deploying Neo4J Enterprise Casual clusters, which is identified in the chart's configuration here. This approach ensures that to configure a Neo4J cluster that the helm chart has to be deployed p times for primary nodes and s times for secondary nodes. This design decision from Neo4J can and will result in quorum violations during cluster updates and or voluntary disruptions resulting in outages, reliability issues and customer churn.
The pod disruption budget is a construct in Kubernetes introduced in version 1.21, that ensures that a minimum number of instances of an application will be running at a given time and will reject requests that violate the pod disruption budget.
Let’s look at a concrete example:
In a 3 primary node Neo4J Statefulset (desired - Neo4J helm chart doesn’t currently offer this), the PodDisruptionBudget should look something like:
where minAvailable=2
It should be noted, that by setting minAvailable=2 that cluster quorum can be maintained where cluster quorum is calculated as being a simple majority.
A simple majority is where availability is > 0.5.
In this scenario,
Kubernetes will ensure that our Pod Disruption Budget is respected, therefore, availability = 2 (total available) / 3 (total desired) = 0.67 = Quorum
What if the Neo4J helm chart is instead deployed 3 times each with 1 instance:
With the above approach, this poses a new problem, this will actually block upgrades of things like Kubernetes node pool upgrades, so the deployment of PDBs will need to be disabled when Statefulset replicas = 1, otherwise a Kubernetes cluster operator will need to delete the PDBs during cluster maintenance.
With PodDisruptionBudgets disabled, there is no construct that prevents multiple Neo4J pods from being drained and rescheduled at the same time, which will result in the Quorum violation.
Helm Chart Example That Allows Multiple Replicas for StatefulSet (Grafana Labs - Mimir Distributed)
https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L868
https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/templates/ingester/ingester-statefulset.yaml#L17
Challenges:
Proposed Changes
Deploy service per pod to allow Raft discovery and integration with existing service discovery mechanism with no code changes required in Neo4J
Deploy existing services (inclusive of all pod endpoints for compatibility with other existing Neo4J helm charts)
else:
Deploy existing services
Example - Loss of cluster quorum during updates
https://github.com/JupiterOne/provision-neo4j-cluster-k8s-aws-example/tree/feature/pod-disruption-violation-example-cluster-updates
Example - Proposed Changes - Working As Expected
https://github.com/JupiterOne/provision-neo4j-cluster-k8s-aws-example