neo4j-contrib / neo4j-helm

Helm Charts for running Neo4j on Kubernetes [DEPRECATED]
https://neo4j-contrib.github.io/neo4j-helm/user-guide/USER-GUIDE.html
Apache License 2.0
88 stars 81 forks source link

Readiness probe prevents recovery #220

Open phoerious opened 3 years ago

phoerious commented 3 years ago

Describe the bug The readiness probe prevents recovery of large databases. If I restart a node from a clean data directory (e.g. after a hard drive failure), it takes a long time for the server to come back up. Particularly the step "Started downloading snapshot for database XXX" can take several minutes (or longer). Unfortunately, this step never finishes, because the readiness probe always kills the container before the download is done.

For now I have set the timeout to one hour via the Helm chart values YAML, but it would be great if there were a more intelligent way to do this, since having a shorter timeout is definitely useful under normal circumstances and I don't always want to reinstall the Helm chart and restart the whole deployment before and after a single-node recovery.

To Reproduce Steps to reproduce the behavior:

  1. Create a large database
  2. Kill off one of the nodes and delete its hard drive
  3. Try to recover the deleted node

Expected behavior Recovery should succeed.