Neo4j pod is not able to deploy on other node after hosting node is down

helanyao commented 4 years ago

Hi neo4j experts, We are running HA test against neo4j installed by latest helm chart https://github.com/neo4j-contrib/neo4j-helm/releases/download/4.1.0-2/neo4j-4.1.0-2.tgz There is a case to shutdown the hosted node and check whether the cluster is able to scale out in other healthy node.

After the node was shutdown, the pod stuck with "Terminating" and there is no new pod created. I tried with other helm chart and they were able to create new pod which means this k8s is working correctly.

The k8s-mk-3 was shutdown and you can see the neo4j pods are stuck root@k8s-master-mk:~# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES test-neo4j-core-0 1/1 Terminating 0 44m 10.42.0.1 k8s-mk-3 algotest-neo4j-core-1 1/1 Running 0 44m 10.47.128.12 k8s-mk-2 test-neo4j-core-2 1/1 Terminating 0 44m 10.42.0.2 k8s-mk-3 test-neo4j-replica-0 1/1 Running 0 44m 10.47.128.13 k8s-mk-2 test-neo4j-replica-1 1/1 Terminating 0 44m 10.42.0.3 k8s-mk-3

For the prometheus, you can see even though the pod on k8s-mk-3 was stuck, there was a new pod running on other pod root@k8s-master-mk:~# kubectl get pods -o wide | grep prometheus prometheus-alertmanager-6cc564fddc-ghv57 2/2 Terminating 0 48m 10.42.0.5 k8s-mk-3 prometheus-alertmanager-6cc564fddc-tz9kx 2/2 Running 0 20m 10.46.0.15 k8s-mk-1 prometheus-kube-state-metrics-7b4b4b7b7f-zfvfz 1/1 Running 0 48m 10.47.128.14 k8s-mk-2 prometheus-node-exporter-2vwvc 1/1 Running 0 48m 10.176.20.193 k8s-mk-2 prometheus-node-exporter-8w652 1/1 Running 0 48m 10.176.21.120 k8s-mk-3 prometheus-node-exporter-gn4fz 1/1 Running 0 48m 10.176.20.192 k8s-mk-1 prometheus-pushgateway-7b644975cb-n8p5g 1/1 Terminating 0 48m 10.42.0.4 k8s-mk-3 prometheus-pushgateway-7b644975cb-vf4z4 1/1 Running 0 20m 10.46.0.14 k8s-mk-1 prometheus-server-56784fbdcb-48mz8 2/2 Running 0 20m 10.47.128.15 k8s-mk-2 prometheus-server-56784fbdcb-4zwvf 2/2 Terminating 0 48m 10.42.0.6 k8s-mk-3

This same result applies to grafana root@k8s-master-mk:# kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES g-grafana-57bbbdbf77-5gn5g 1/1 Terminating 0 10m 10.42.0.14 k8s-mk-3 g-grafana-57bbbdbf77-znkk9 1/1 Running 0 3m42s 10.46.0.3 k8s-mk-1

Could you help to look into this issue?

moxious commented 4 years ago

Need some more information here.

What method was used to shutdown the pod that won't stop? (Command)
What are the debug.log file on the "stuck" member saying? I want to try to distinguish between neo4j not being responsive, and something still running and what that might be?
Can you describe more what you're trying to do here, you say shut down one and scale out to another healthy node. Is this a rolling upgrade operation, or -- operationally -- what are you trying to accomplish?

helanyao commented 4 years ago

Need some more information here.

What method was used to shutdown the pod that won't stop? (Command)

What are the debug.log file on the "stuck" member saying? I want to try to distinguish between neo4j not being responsive, and something still running and what that might be?

Can you describe more what you're trying to do here, you say shut down one and scale out to another healthy node. Is this a rolling upgrade operation, or -- operationally -- what are you trying to accomplish?

The node was shutdown from GCP control panel and the POD in "Terminating" was left untouched Grafana and Prometheus are able to deploy of new PODs to other nodes to maintain ATB.

There is no way to attach to the stuck pod to collect log as the host node is down

The test case wants to simulate the reality that one VM hosting the k8s node was shutdown unexpectedly, e.g. lost power We want to ensure in this kind of situation, neo4j cluster is able to flex up in other healthy node to maintain ATB.

Here is an example Say we have 3 cores of cluster and one of them(core-0) are running on a node A. Then node A showed as NotReady as its hosting VM lost power.

What we expect is, a new core-0 is able to run on other node even though the old core-0 is stuck.

However, here is what we observe, there was no new core-0 and the old core-1 showed as "Terminating". Thus the cluster only has 2 running cores.

moxious commented 4 years ago

@helanyao ok. So the first order of business is going to be to figure out a way to collect the logs since we probably don't get any further with this issue until we can tell what's happening on that node. A thing I might suggest is to change in your config to either set NEO4J_dbms_directories_logs=/data/logs so that your logs will end up on the PVC, or if you're using some other logging solution to configure that to stream logs to a separate service so we can see what's happening.

Can you kubectl logs to get the regular (neo4j.log) logs for the container? What does that say?

This is just a guess -- but when the DB shuts down it needs to "checkpoint". If your checkpoint is lasting a really long time because you have a large TX history or a lot of data, this could cause the node to take a long time to terminate. You might look into your checkpointing configuration

moxious commented 4 years ago

In addition to container logs, please try kubectl describe pod and report back with the output.

helanyao commented 4 years ago

Here is what I observed When I used helm chart 4.0.4, the replica is deployment type and even with the node down, the replica was able to be created on other pod. However, there was a change to the replica type and it is now statefulset as core. https://github.com/neo4j-contrib/neo4j-helm/commit/c444aa6441e873ea00ccccad06f2ec3e20db8955

It seems that statefulset cannot have multiple instance while deployment can shift. I also checked the grafana and prometheus and all of them are deployment. This is the reason why they can start new pod in other healthy node when hosting node is down.

moxious commented 4 years ago

I'm not sure I understand. It's true a change was made to take replicas to a statefulset. A big reason why we did this is to back read replicas with disks...if you had a very large dataset, it would be very expensive to add and scale replicas in a deployment.

I don't understand the part about multiple instances. In both cases, you should only have one grouping for replicas, either a statefulset or a deployment. You should never have more than one deployment, unless it is part of a completely separate cluster / install.

helanyao commented 4 years ago

2020-07-21 10:07:28.435+0000 WARN [o.n.k.i.JvmChecker] The dbms.memory.heap.max_size setting has not been configured. It is recommended that this setting is always explicitly configured, to ensure the system has a balanced configuration. Until then, a JVM computed heuristic of 15303704576 bytes is used instead. Run neo4j-admin memrec for memory configuration suggestions. 2020-07-21 10:07:28.435+0000 WARN [o.n.k.i.JvmChecker] The dbms.memory.heap.initial_size setting has not been configured. It is recommended that this setting is always explicitly configured, to ensure the system has a balanced configuration. Until then, a JVM computed heuristic of 989855744 bytes is used instead. Run neo4j-admin memrec for memory configuration suggestions. 2020-07-21 10:07:28.487+0000 INFO [o.n.i.d.DiagnosticsManager] **** 2020-07-21 10:07:28.488+0000 INFO [o.n.i.d.DiagnosticsManager] [ System diagnostics ] 2020-07-21 10:07:28.488+0000 INFO [o.n.i.d.DiagnosticsManager] **** 2020-07-21 10:07:28.493+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.493+0000 INFO [o.n.i.d.DiagnosticsManager] [ System memory information ] 2020-07-21 10:07:28.493+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.507+0000 INFO [o.n.i.d.DiagnosticsManager] Total Physical memory: 58.97GiB 2020-07-21 10:07:28.507+0000 INFO [o.n.i.d.DiagnosticsManager] Free Physical memory: 1.156GiB 2020-07-21 10:07:28.508+0000 INFO [o.n.i.d.DiagnosticsManager] Committed virtual memory: 17.06GiB 2020-07-21 10:07:28.512+0000 INFO [o.n.i.d.DiagnosticsManager] Total swap space: 0B 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] Free swap space: 0B 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] [ JVM memory information ] 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.514+0000 INFO [o.n.i.d.DiagnosticsManager] Free memory: 809.8MiB 2020-07-21 10:07:28.514+0000 INFO [o.n.i.d.DiagnosticsManager] Total memory: 912.6MiB 2020-07-21 10:07:28.515+0000 INFO [o.n.i.d.DiagnosticsManager] Max memory: 14.25GiB 2020-07-21 10:07:28.523+0000 INFO [o.n.i.d.DiagnosticsManager] Garbage Collector: Copy: [Eden Space, Survivor Space] 2020-07-21 10:07:28.524+0000 INFO [o.n.i.d.DiagnosticsManager] Garbage Collector: MarkSweepCompact: [Eden Space, Survivor Space, Tenured Gen] 2020-07-21 10:07:28.525+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: CodeHeap 'non-nmethods' (Non-heap memory): committed=2.438MiB, used=1.117MiB, max=5.559MiB, threshold=0B 2020-07-21 10:07:28.526+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=20.50MiB, used=19.95MiB, max=-1B, threshold=0B 2020-07-21 10:07:28.526+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Tenured Gen (Heap memory): committed=629.4MiB, used=0B, max=9.829GiB, threshold=0B 2020-07-21 10:07:28.527+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: CodeHeap 'profiled nmethods' (Non-heap memory): committed=3.000MiB, used=2.964MiB, max=117.2MiB, threshold=0B 2020-07-21 10:07:28.527+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Eden Space (Heap memory): committed=251.8MiB, used=102.8MiB, max=3.932GiB, threshold=? @ "debug.log" [readonly] 3344L, 516587C 1,1 Top root@test:~# ll total 40 drwx------ 7 root root 4096 Jun 10 13:55 ./ drwxr-xr-x 24 root root 4096 Jul 21 03:25 ../ drwx------ 3 root root 4096 Jun 10 13:38 .ansible/ -rw-r--r-- 1 root root 3106 Oct 22 2015 .bashrc drwxr-xr-x 3 root root 4096 Apr 15 13:49 .config/ 2020-07-21 10:07:28.435+0000 WARN [o.n.k.i.JvmChecker] The dbms.memory.heap.max_size setting has not been configured. It is recommended that this setting is always explicitly configured, to ensure the system has a balanced configuration. Until then, a JVM computed heuristic of 15303704576 bytes is used instead. Run neo4j-admin memrec for memory configuration suggestions. 2020-07-21 10:07:28.435+0000 WARN [o.n.k.i.JvmChecker] The dbms.memory.heap.initial_size setting has not been configured. It is recommended that this setting is always explicitly configured, to ensure the system has a balanced configuration. Until then, a JVM computed heuristic of 989855744 bytes is used instead. Run neo4j-admin memrec for memory configuration suggestions. 2020-07-21 10:07:28.487+0000 INFO [o.n.i.d.DiagnosticsManager] **** 2020-07-21 10:07:28.488+0000 INFO [o.n.i.d.DiagnosticsManager] [ System diagnostics ] 2020-07-21 10:07:28.488+0000 INFO [o.n.i.d.DiagnosticsManager] **** 2020-07-21 10:07:28.493+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.493+0000 INFO [o.n.i.d.DiagnosticsManager] [ System memory information ] 2020-07-21 10:07:28.493+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.507+0000 INFO [o.n.i.d.DiagnosticsManager] Total Physical memory: 58.97GiB 2020-07-21 10:07:28.507+0000 INFO [o.n.i.d.DiagnosticsManager] Free Physical memory: 1.156GiB 2020-07-21 10:07:28.508+0000 INFO [o.n.i.d.DiagnosticsManager] Committed virtual memory: 17.06GiB 2020-07-21 10:07:28.512+0000 INFO [o.n.i.d.DiagnosticsManager] Total swap space: 0B 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] Free swap space: 0B 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] [ JVM memory information ] 2020-07-21 10:07:28.513+0000 INFO [o.n.i.d.DiagnosticsManager] -------------------------------------------------------------------------------- 2020-07-21 10:07:28.514+0000 INFO [o.n.i.d.DiagnosticsManager] Free memory: 809.8MiB 2020-07-21 10:07:28.514+0000 INFO [o.n.i.d.DiagnosticsManager] Total memory: 912.6MiB 2020-07-21 10:07:28.515+0000 INFO [o.n.i.d.DiagnosticsManager] Max memory: 14.25GiB 2020-07-21 10:07:28.523+0000 INFO [o.n.i.d.DiagnosticsManager] Garbage Collector: Copy: [Eden Space, Survivor Space] 2020-07-21 10:07:28.524+0000 INFO [o.n.i.d.DiagnosticsManager] Garbage Collector: MarkSweepCompact: [Eden Space, Survivor Space, Tenured Gen] 2020-07-21 10:07:28.525+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: CodeHeap 'non-nmethods' (Non-heap memory): committed=2.438MiB, used=1.117MiB, max=5.559MiB, threshold=0B 2020-07-21 10:07:28.526+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Metaspace (Non-heap memory): committed=20.50MiB, used=19.95MiB, max=-1B, threshold=0B 2020-07-21 10:07:28.526+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Tenured Gen (Heap memory): committed=629.4MiB, used=0B, max=9.829GiB, threshold=0B 2020-07-21 10:07:28.527+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: CodeHeap 'profiled nmethods' (Non-heap memory): committed=3.000MiB, used=2.964MiB, max=117.2MiB, threshold=0B 2020-07-21 10:07:28.527+0000 INFO [o.n.i.d.DiagnosticsManager] Memory Pool: Eden Space (Heap memory): committed=251.8MiB, used=102.8MiB, max=3.932GiB, threshold=? @ "debug.log" [readonly] 3344L, 516587C 1,1 Top MemberId{74925480}=DiscoveryDatabaseState{DatabaseId{8113c36c}, operatorState=STOPPED, result=SUCCESS} MemberId{ab129dc0}=DiscoveryDatabaseState{DatabaseId{8113c36c}, operatorState=STARTED, result=SUCCESS} MemberId{a4593401}=DiscoveryDatabaseState{DatabaseId{8113c36c}, operatorState=STARTED, result=SUCCESS} 2020-07-21 10:50:12.076+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [system] Pausing due to shutdown (count = 2) 2020-07-21 10:50:12.086+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] [system] Requirement Database unavailable makes database system unavailable. 2020-07-21 10:50:12.086+0000 INFO [o.n.k.a.DatabaseAvailabilityGuard] [system] Database system is unavailable. 2020-07-21 10:50:12.089+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] [system] Checkpoint triggered by "Database shutdown" @ txId: 63 checkpoint started... 2020-07-21 10:50:12.120+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] [system] Checkpoint triggered by "Database shutdown" @ txId: 63 checkpoint completed in 30ms 2020-07-21 10:50:12.120+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] [system] No log version pruned. The strategy used was '104857600 size'. Last checkpoint was made in log version 0. 2020-07-21 10:50:12.147+0000 INFO [c.n.c.c.c.l.p.PruningScheduler] [system] PruningScheduler stopping 2020-07-21 10:50:12.149+0000 INFO [c.n.c.n.Server] backup-server: stopping and unbinding from: 0.0.0.0:6362 2020-07-21 10:50:12.255+0000 INFO [c.n.c.n.Server] catchup-server: stopping and unbinding from: 0.0.0.0:6000 2020-07-21 10:50:12.358+0000 INFO [c.n.c.n.Server] raft-server: stopping and unbinding from: 0.0.0.0:7000 2020-07-21 10:50:12.543+0000 INFO [c.n.c.d.a.GlobalTopologyState] The core replicated states for database DatabaseId{00000000} changed previous state was: MemberId{74925480}=DiscoveryDatabaseState{DatabaseId{00000000}, operatorState=STARTED, result=SUCCESS} MemberId{ab129dc0}=DiscoveryDatabaseState{DatabaseId{00000000}, operatorState=STARTED, result=SUCCESS} MemberId{a4593401}=DiscoveryDatabaseState{DatabaseId{00000000}, operatorState=STARTED, result=SUCCESS} current state is: MemberId{74925480}=DiscoveryDatabaseState{DatabaseId{00000000}, operatorState=STOPPED, result=SUCCESS} MemberId{ab129dc0}=DiscoveryDatabaseState{DatabaseId{00000000}, operatorState=STARTED, result=SUCCESS} MemberId{a4593401}=DiscoveryDatabaseState{DatabaseId{00000000}, operatorState=STARTED, result=SUCCESS} 2020-07-21 10:50:12.543+0000 INFO [c.n.c.d.a.GlobalTopologyState] Core topology for database DatabaseId{00000000} is now: [MemberId{ab129dc0}] Lost members :[MemberId{74925480}] No new members 2020-07-21 10:50:13.093+0000 WARN [a.e.DummyClassForStringSources] Cluster Node [akka://cc-discovery-actor-system@discovery-log3-neo4j-2.default.svc.cluster.local:5000] - Marking node(s) as UNREACHABLE [Member(address = akka://cc-discovery-actor-system@discovery-log3-neo4j-0.default.svc.cluster.local:5000, status = Up)]. Node roles [dc-default]

helanyao commented 4 years ago

@moxious here is the log of that container on the down node.

moxious commented 4 years ago

@helanyao closing this issue -- not that it's resolved, but more that discussion continues on slack now.

neo4j-contrib / neo4j-helm

Neo4j pod is not able to deploy on other node after hosting node is down #52