neo4j / helm-charts

Apache License 2.0
58 stars 50 forks source link

Neo4J Helm Chart - Statefulset Readiness/Start Up Probe Reports Healthy Before All Databases are Fully Started - Support customizing Neo4J readiness/start up probes or update Neo4J readiness/start up probe implementation to address #337

Open ryanmcafee opened 1 month ago

ryanmcafee commented 1 month ago

Is your feature request related to a problem? Please describe.

At JupiterOne, we are an enterprise customer of Neo4J and manage the deployment of Neo4J clusters on Kubernetes using the upstream Neo4J Helm chart. We are experiencing an issue during rolling deployments/updates to Neo4J clusters (using ArgoCD App of Apps pattern and ArgoCD Sync Waves) that can lead to service outages for customers with large databases. The readiness probe for a Neo4J cluster member Kubernetes pod is reported as healthy before all databases are fully started and available. This premature healthy status will cause the ArgoCD sync wave to progress. Multiple Neo4J clusters being updated in quick succession like this will cause a loss in write quorum, since a database across multiple neo4j cluster members will be in a "starting" state, not an "online" state resulting in service outages (loss of write and severely degraded read performance).

Additional Context

We are using the v5.20.0 version of the Neo4J helm chart.

To manage the deployment of our Neo4J clusters, we are using Terraform + ArgoCD to deploy and manage our Neo4J clusters.

During an automated rolling deployment with ArgoCD and ArgoCD sync waves, the readiness probe will report a Neo4J cluster member k8s pod as healthy before all databases are fully started and available.

The readiness probe status is propagated up the parent ArgoCD application, which reports the Neo4J cluster member as healthy, which allows ArgoCD to progress to syncing the next child application.

We have found that this can lead to a situation for larger databases that can take 10-15 minutes to start, where the readiness probe reports a Neo4J cluster member k8s pod as healthy, but all databases are not fully started and available.

In the case for a 3 node cluster, we have found cases where all 3 nodes are reported as healthy by the readiness probe, but a (non neo4j/system) database is still starting on all 3 nodes, which leads to a loss of quorum for the database and a complete service outage for the database.

Getting further into supporting details:

We are using the ArgoCD app of apps pattern to deploy a root neo4j-cluster ArgoCD application, which will deploy as children the cluster members for a Neo4j cluster. The child Neo4J cluster members have an ArgoCD sync wave (https://argo-cd.readthedocs.io/en/stable/user-guide/sync-waves/) specified, which allows us to perform rolling deployments in order and wait for a healthy status of a Neo4J cluster member.

Our ArgoCD app of apps pattern is working well for deploying and managing our Neo4J clusters, but we are seeing an issue with the readiness probe for a Neo4j cluster member k8s pod.

Upon investigation, we found that the Neo4J helm chart today does not support custom health checks/overrides for the readiness, startup, or liveness probes https://github.com/neo4j/helm-charts/blob/dev/neo4j/templates/neo4j-statefulset.yaml#L158, but instead works by performing a TCP check on the bolt port of a Neo4j cluster member k8s pod. The bolt endpoint TCP health check does not take into consideration if all non neo4j/system databases have first been started.

We are seeing an issue wherein a Neo4j cluster member k8s pod is considered/reported healthy by the readiness probe before all databases for a Neo4j cluster member are fully started and available.

It would appear that the bolt port is available before all databases are fully started and available, which is why the readiness probe is reporting the pod as healthy before all databases are fully started and available.

We consider the current readiness probe dangerous because you can quickly get into a situation where a Neo4j cluster member k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

Once the readiness probe reports a pod as healthy, the ArgoCD sync wave will proceed to the next pod in the deployment, which can lead to a situation where the Neo4j cluster member k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

An example of our ArgoCD app of apps pattern for the management of a Neo4j cluster: neo4j-cluster-001-dev looks like the following:

neo4j-cluster-001-dev - neo4j-cluster argocd application (Internal) - ArgoCD Neo4J app of apps neo4j-cluster-001-p-002-dev - Neo4J Cluster Member - Using upstream helm chart - https://github.com/neo4j/helm-charts/blob/dev/neo4j neo4j-cluster-001-p-002-dev - Neo4J Cluster Member - Using upstream helm chart - https://github.com/neo4j/helm-charts/blob/dev/neo4j neo4j-cluster-001-p-003-dev - Neo4J Cluster Member - Using upstream helm chart - https://github.com/neo4j/helm-charts/blob/dev/neo4j

For example, let's assume that we have a Neo4j cluster wherein we have updated Neo4j configuration settings:

We will have made an update to the neo4j-cluster argocd root application, which will trigger a rolling deployment of the neo4j-cluster members by the ArgoCD application controller.

Rolling deployment process:

1) ArgoCD syncs the update for neo4j-cluster-001-p-001-dev and waits for the Neo4j container readiness probe to report healthy.

The readiness probe reports the pod as healthy before all databases are fully started and available as indicated by: SHOW DATABASES The application health status is propagated up the parent ArgoCD application, which reports the neo4j-cluster-001-p-001-dev member as healthy, which allows ArgoCD to progress to syncing the next child application: neo4j-cluster-001-p-002-dev

SHOW DATABASES output for neo4j-cluster-001-p-001-dev after readiness probe reports pod as healthy: Databases: neo4j: requestedStatus: online currentStatus: online system: requestedStatus: online currentStatus: online some-large-database: requestedStatus: online currentStatus: starting

2) ArgoCD syncs the update for neo4j-cluster-001-p-002-dev and waits for the Neo4j container readiness probe to report healthy.

ArgoCD starts the sync/update for neo4j-cluster-001-p-002-dev and waits for the Neo4j container readiness probe to report healthy (meanwhile some-large-database is still starting for cluster member: neo4j-cluster-001-p-001-dev ).

ArgoCD syncs the update for neo4j-cluster-001-p-002-dev and waits for the Neo4j container readiness probe to report healthy.

neo4j-cluster-001-p-002-dev k8s pod is restarted.

Kubernetes evaluates the readiness probe for the neo4j-cluster-001-p-002-dev k8s pod and reports the pod as healthy before all databases are fully started and available as indicated by: SHOW DATABASES.

At this point, the neo4j-cluster-001-p-002-dev k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

SHOW DATABASES output for neo4j-cluster-001-p-001-dev after readiness probe reports pod as healthy: Databases: neo4j: requestedStatus: online currentStatus: online system: requestedStatus: online currentStatus: online some-large-database: requestedStatus: online currentStatus: starting

Reads available: yes (degraded - high latency - instance overloaded) Writes available: no

The application health status is propagated up the parent ArgoCD application, which reports the neo4j-cluster-001-p-002-dev member also as healthy.

Lost quorum: 2/3 Instances - neo4j-cluster-001-p-001-dev and neo4j-cluster-001-p-002-dev are not available for database: some-large-database some-large-database cluster members available: 1/3 (neo4j-cluster-001-p-003-dev is still available)

ArgoCD progresses to syncing the next child application: neo4j-cluster-001-p-003-dev due to the readiness probe reporting healthy and the neo4j-cluster-001-p-002-dev k8s pod as healthy.

3) ArgoCD syncs the update for neo4j-cluster-001-p-003-dev and waits for the Neo4j container readiness probe to report healthy.

ArgoCD starts the sync/update for neo4j-cluster-001-p-003-dev and waits for the neo4j container readiness probe to report healthy (meanwhile some-large-database is still starting for cluster member: neo4j-cluster-001-p-001-dev, eo4j-cluster-001-p-002-dev, and now neo4j-cluster-001-p-003-dev ).

ArgoCD syncs the update for neo4j-cluster-001-p-003-dev and waits for the Neo4j container readiness probe to report healthy.

neo4j-cluster-001-p-003-dev k8s pod is restarted to apply update.

Kubernetes evaluates the readiness probe for the neo4j-cluster-001-p-003-dev k8s pod and reports the pod as healthy before all databases are fully started and available as indicated by: SHOW DATABASES.

At this point, the neo4j-cluster-001-p-003-dev k8s pod is considered healthy by the readiness probe, but all databases are not fully started and available.

Databases: neo4j: requestedStatus: online currentStatus: online system: requestedStatus: online currentStatus: online some-large-database: requestedStatus: online currentStatus: starting

The application health status is propagated up the parent ArgoCD application, which reports the neo4j-cluster-001-p-003-dev member also as healthy.

Lost quorum: 3/3 neo4j-cluster-001-p-001-dev, neo4j-cluster-001-p-002-dev, neo4j-cluster-001-p-003-dev are not available for database: some-large-database some-large-database cluster members available: 0/3

Reads available: no Writes available: no

Implications:

Complete service outage for customer with database: some-large-database Sev1 is created for the outage Customer is unhappy Reputational damage Increases risk of customer churn.

Describe the solution you'd like

Option 1:

Customizable Health Checks (Readiness, Liveness, Startup Probes):

Summary: Provide support for customizing health checks/overrides for the readiness, liveness and startup probes in the Neo4J helm chart.

Details:

Use cases: Wait for all databases to be fully started and available before the readiness/startup probe reports a neo4j cluster member k8s pod as healthy.

Pros:

Cons:

Related PR/Implementation: https://github.com/neo4j/helm-charts/pull/338

Option 2:

Update Readiness/Startup Probes Implementation to Wait for All Databases to be Fully Started and Available Before Reporting a Neo4J Cluster Member K8s Pod as Healthy.

Summary: Update the readiness/startup probes default behavior to wait for all databases to be fully started and available before the readiness probe reports a Neo4j cluster member k8s pod as healthy.

Use cases: Wait for all databases to be fully started and available before the readiness/startup probe reports a Neo4j cluster member k8s pod as healthy.

Details:

Pros:

Cons:

Describe alternatives you've considered

Thank you for your attention to this matter. We look forward to your feedback on the reported issue and proposed solutions.

davidlrosenblum commented 1 month ago

@ryanmcafee - I think the question is what constitutes readiness. Ability to query the database(s)? That is probably empirically the best evidence the db is ready. Warmup complete also?

ryanmcafee commented 1 month ago

Hi @davidlrosenblum, the core issue is that the startupProbe and readinessProbe logic are currently using the same check, but at the very least, I would like to see the ability to override or customize the startUp probe to address the issues I expand and clarify on down below.

I am advocating for the startUp probe to be more flexible in the startup health checks it enforces and allow the consumer of the Neo4J helm chart to override the default startUp probe.

The startUp probe will delay the readiness probe from being evaluated, so the startup probe can implement more exhaustive database started checks, that we (JupiterOne/Neo4J/other customers running Neo4J) generally wouldn't want to have implemented in the pod readiness probe (since we want pods to continue receiving traffic in the case of a cluster operator stopping a multi-tenant Neo4J database, etc..)

Running:

SHOW DATABASE foo

Will contain columns/fields that have values aligning to the following:

requestedStatus: "offline" | "online" currentStatus: "online" | "offline" | "starting" | "stopping" | "store copying" | "initial" | "deallocating" | "dirty" | "quarantined" | "unknown"

According to the Neo4J docs: https://neo4j.com/docs/operations-manual/current/clustering/monitoring/show-databases-monitoring/, a currentStatus of: "starting" only denotes recovery is in progress, i.e. that transaction logs are being replayed, etc..

When a neo4j server/neo4j cluster member is restarted, the requested status will be: "online" and the currentStatus will be: "starting"

Initial state:

currentStatus -> requestedStatus "starting" -> "online"

Target state:

currentStatus -> requestedStatus "online" -> "online"

The current start up probe will report as healthy (albeit I consider this behavior a bug/incorrect) under the following condition:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name    | type      | aliases | access     | address        | role      | writer | requestedStatus | currentStatus | statusMessage | default | home | constituents |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|"neo4j"  |"standard" |[]       |"read-write"|"localhost:7687"| "primary" | true   | "online"        | "online"      | ""            |true     |true  |[]            |
|"system" |"system"   |[]       |"read-write"|"localhost:7687"| "primary" | true   | "online"        | "online"      | ""            |false    |false |[]            |
|"test"   |"system"   |[]       |"read-write"|"localhost:7687"| "primary" | true   | "online"        | "starting"    | ""            |false    |false |[]            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

While I consider the above "ready" () (can communicate with the bolt endpoint), I certainly do not consider all the databases started and online (I.e. - Exactly what we want to see implemented in the startupProbe).

The custom startUp probe (outside of this pr/issue) I made will delay reporting a healthy startup probe status when the following database conditions are met:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| name    | type      | aliases | access     | address        | role      | writer | requestedStatus | currentStatus | statusMessage | default | home | constituents |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|"neo4j"  |"standard" |[]       |"read-write"|"localhost:7687"| "primary" | true   | "online"        | "online"      | ""            |true     |true  |[]            |
|"system" |"system"   |[]       |"read-write"|"localhost:7687"| "primary" | true   | "online"        | "online"      | ""            |false    |false |[]            |
|"test"   |"system"   |[]       |"read-write"|"localhost:7687"| "primary" | true   | "online"        | "online"      | ""            |false    |false |[]            |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Meanwhile, while all the above databases are now in an online state, the databases could still be loading/hydrating the page cache.

I think there's many different ways to implement custom startupProbes depending on use cases, but I think it's certainly a reasonable request to allow the consumer of the helm chart to customize and override these probes for their own use cases.

This is a pr I raised to extend the neo4j helm chart to support these use cases:

https://github.com/neo4j/helm-charts/pull/338

Ultimately, I will be happy to revise my pr to just scope it to adding support for overriding the startupProbe if that's the easiest path to getting this merged upstream in the Neo4J helm chart.

Start up probe:

Are all the databases in a healthy **and started** state?
(If databases are in a "starting" state, wait until they are in an "online" state before reporting the startup probe status as healthy)
Prematurely reporting the pod check as healthy, will result in premature deployment progression when using ArgoCD or other deployment tools to perform a rolling update.

If the deployment tool is not configured to wait for the databases to be in an online state, the deployment will progress and begin the upgrade process for the additional pods as well.

In the example of a 3 node cluster, this can result in 2 of the 3 nodes being in an starting state, while the 3rd node is still in a online state, this leads to loss of write quorum.

Readiness probe:

Does the bolt tcp port respond with a healthy status? (Indicates the Neo4J process is running and accepting connections)
(Current upstream Neo4J readiness probe check sufficient)

Liveness probe:

Is the Neo4J process running?
If not, restart the pod.

(Current upstream Neo4J liveness probe check sufficient)

Additional reading:

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Yoni-hataim commented 1 week ago

hi, I know this is not really related to this thread but I've been having trouble understanding how to deploy the chart using argocd. I wanted to ask if you created multiple values file for each cluster member or is there another way. also I would love to take a look at your Application set if possible Thank you in advance