Open Starefossen opened 5 years ago
Thank you for filing the issue - this is good information. We’ll take a look ASAP. At first glance It looks like the k8s NATS service (network) failed and it wasn't handled appropriately.
CC @wallyqs @variadico
Thanks for the report, sorry for the inconvenience, we're looking at this and into moving to using statefulsets internally as well within the operator instead of the current controller logic.
First of, we are extremely happy with NATS – so thank you for the hard work and dedication! ❤️ Secondly, we have determined that the reason the NATS Server Pods was shutting down was due to them being terminated by Chaoskube a process that kills pods randomly. In all previous instances and we have had it running for a month the NATS Operator has done its job and re-created the missing pods – but not in this instance.
So this happend a couple of days ago in one of our environments. After a while all of the Pods in the NATS Operator controlled NATS Cluster was completely gone and NATS Operator was running nominally without errors.
Setup
v1.12.7-gke.25
v0.4.4
v1.4.1
v0.6.0
Events
nats-cluster-1
andnats-cluster-3
lose connection withnats-cluster-2
(10.44.0.73) with errorconnect: no route to host
nats-operator
realises thatnats-cluster-2
is not working correctly:deleting pod "apps-test/nats-cluster-2" in terminal phase "Failed"
nats-cluster-1
lose connection withnats-cluster-3
(10.44.5.55) with errorconnect: no route to host
[ERR] Error trying to connect to route: dial tcp: lookup nats-cluster-2.nats-cluster-mgmt.apps-test.svc on 10.111.0.10:53: no such host
Observations
There are several observations: