nats-io / nats-operator

NATS Operator
https://nats-io.github.io/k8s/
Apache License 2.0
574 stars 111 forks source link

NATS Cluster Pods Gone #208

Open Starefossen opened 5 years ago

Starefossen commented 5 years ago

So this happend a couple of days ago in one of our environments. After a while all of the Pods in the NATS Operator controlled NATS Cluster was completely gone and NATS Operator was running nominally without errors.

Setup

Events

Tiime Event
06:10:42 nats-cluster-1 and nats-cluster-3 lose connection with nats-cluster-2 (10.44.0.73) with error connect: no route to host
06:10:46 nats-operator realises that nats-cluster-2 is not working correctly: deleting pod "apps-test/nats-cluster-2" in terminal phase "Failed"
Lots and lots of no route to host messages
09:44:17  nats-cluster-1 lose connection with nats-cluster-3 (10.44.5.55) with error connect: no route to host
09:57:38 last log statement from nats-cluster-1: [ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.apps-test.svc on 10.111.0.10:53: no such host
09:57:39 nats-cluster is completely offline since there are no pods

Observations

There are several observations:

ColinSullivan1 commented 5 years ago

Thank you for filing the issue - this is good information. We’ll take a look ASAP. At first glance It looks like the k8s NATS service (network) failed and it wasn't handled appropriately.

CC @wallyqs @variadico

wallyqs commented 5 years ago

Thanks for the report, sorry for the inconvenience, we're looking at this and into moving to using statefulsets internally as well within the operator instead of the current controller logic.

Starefossen commented 5 years ago

First of, we are extremely happy with NATS – so thank you for the hard work and dedication! ❤️ Secondly, we have determined that the reason the NATS Server Pods was shutting down was due to them being terminated by Chaoskube a process that kills pods randomly. In all previous instances and we have had it running for a month the NATS Operator has done its job and re-created the missing pods – but not in this instance.