jetstack / navigator

Managed Database-as-a-Service (DBaaS) on Kubernetes
Apache License 2.0
271 stars 31 forks source link

Pilot doesn't notice when Cassandra process dies #217

Closed wallrj closed 6 years ago

wallrj commented 6 years ago

I killed the Cassandra process using kill inside a Cassandra pod in my test cluster.

I expected the Pilot to exit immediately.

richard@richardw-pet1:~/go/src/github.com/jetstack/navigator$ kubectl --namespace test-cassandra-1516791475-12782 exec  cass-cassandra-1516791475-12782-cassandra-ringnodes-0 -it -- ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      1263  0.0  0.0  34420  2892 ?        Rs+  11:08   0:00 ps faux
root         1  0.6  0.1  32752 21092 ?        Ssl  10:58   0:04 /shared/pilot -
root        15  0.0  0.0  18068  2904 ?        S    10:58   0:00 /bin/bash /run.
root        32  0.0  0.0  46988  2924 ?        T    10:58   0:00  \_ su cassandr
cassand+    33  5.6  0.0      0     0 ?        Zs   10:58   0:35      \_ [java]

What I got was a zombie Java process. And the Pilot kept running until Kubernetes killed it after X failed liveness probes.

/kind bug

wallrj commented 6 years ago

Although this doesn't always happen.

munnerz commented 6 years ago

Could this perhaps be down to Cassandra not having exited properly or something? Pilot now calls Wait() on the process and will exit when Wait returns. It’d be great if we can isolate what conditions cause this. On Wed, 24 Jan 2018 at 11:24, Richard Wall notifications@github.com wrote:

Although this doesn't always happen.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jetstack/navigator/issues/217#issuecomment-360101258, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMbP_Duw9_aQS_RzIAV5BpuUakjwYQWks5tNxLVgaJpZM4RrGGY .

wallrj commented 6 years ago

Looking again at the process tree in the comment above, it occurs to me that the problem was probably that the parent su command wasn't waiting for and exiting when the cassandra process died.

That should be fixed now that #222 is resolved.