rabbitmq / rabbitmq-peer-discovery-k8s

Kubernetes-based peer discovery mechanism for RabbitMQ
Other
295 stars 94 forks source link

In case k8s API Server goes down unexpectedly, the plugin breaks clustering and isn't capable of recovering #20

Closed drf closed 6 years ago

drf commented 6 years ago

In a K8S cluster where RMQ 3.7.1 was installed with 3 nodes and the k8s peer discovery, there has been a temporary failure in the API server/internal networking, which led to the following events:

2018-01-24 00:18:06.171 [info] <0.375.0> rabbit on node 'rabbit@10.244.0.12' down 2018-01-24 00:18:06.639 [info] <0.375.0> Node rabbit@10.244.0.12 is down, deleting its listeners 2018-01-24 00:18:06.641 [info] <0.375.0> node 'rabbit@10.244.0.12' down: connection_closed 2018-01-24 00:18:28.971 [info] <0.470.0> Failed to get nodes from k8s - {failed_connect,[{to_address,{"kubernetes.default.svc.cluster.local",443}}, {inet,[inet],nxdomain}]} 2018-01-24 00:18:28.971 [warning] <0.470.0> Peer discovery: removing unknown node rabbit@10.244.0.12 from the cluster 2018-01-24 00:18:28.971 [info] <0.470.0> Removing node 'rabbit@10.244.0.12' from cluster 2018-01-24 00:18:30.986 [info] <0.369.0> Node 'rabbit@10.244.0.12' was removed from the cluster, deleting its connection tracking tables... 2018-01-24 00:18:32.718 [warning] <0.18522.33> closing AMQP connection <0.18522.33> (10.244.0.25:34683 -> 10.244.1.5:5672, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection 2018-01-24 00:18:54.780 [warning] <0.879.0> closing AMQP connection <0.879.0> (10.244.0.11:40485 -> 10.244.1.5:5672, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection

The nodes have been detached from the cluster and, despite the fact the failure of the API server was temporary and it recovered some minutes later, they never rejoined the cluster, leaving them hanging in the k8s service. Other clustered services, such as Cassandra, didn't experience the same issue, and managed to reconnect successfully:

INFO 00:58:44 InetAddress /10.244.3.2 is now DOWN INFO 00:59:01 Redistributing index summaries INFO 01:00:59 Handshaking version with /10.244.3.2 INFO 01:01:01 Handshaking version with /10.244.3.2 INFO 01:01:01 InetAddress /10.244.3.2 is now UP

The peer discovery plugin should be able to recover from such failures and restore the cluster state.

michaelklishin commented 6 years ago

Inability to contact a dependency (such as Kubernetes API endpoint) is considered to be a fatal event by design. There are no plans to revisit it without a substantial volume of user feedback, which takes time.

Use cluster monitoring tools and recreate the nodes that are not in the state you expect. It's a good idea regardless of whether peer discovery is used.

michaelklishin commented 6 years ago

Peer discovery: removing unknown node rabbit@10.244.0.12 from the cluster

as Node Cleanup explains, this feature is

  1. Opt-in. The user chooses to enable it.
  2. Has pretty well defined risks: if a node does not report its health to the backend, it will be removed by its peers. Don't enable this feature if this is not desired.
drf commented 6 years ago

I see. It's worth noting that the cluster wasn't a production cluster (I would have noticed the failure before a week) and it was just a field test using the example provided in this repository.

In particular, the behavior you describe is indeed enabled in your example https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/blob/master/examples/k8s_statefulsets/rabbitmq_statefulsets.yaml#L40 , probably because you assume nodes might reasonably die because of a scaledown of the StatefulSet or pods getting destroyed, and when they come up again they might have a different IP/node name. Correct me if I'm wrong here.

If that is the case (hence that option is needed on Kubernetes), nodes should be able to recognize a cluster failure and behave accordingly, if not in the plugin at least in the statefulset configuration.

So from my POV:

  1. I'm totally wrong in my analysis, hence the example should be fixed to use true instead of false for cluster_formation.node_cleanup.only_log_warning.
  2. only_log_warning should actually be true for Kubernetes, and something should be done to ensure pods don't get scheduled/get destroyed if there is any issue in node communication.

In case 2, I'm willing to come up with a patch

Thanks

michaelklishin commented 6 years ago

As mentioned in the docs, RabbitMQ never removes nodes from a cluster except for the opt-in feature above (or when the operator uses CLI tools to do that).

Nodes that are already cluster members do not perform peer discovery:

If a node previously was a cluster member, it will try to contact its "last seen" peer for a period of time. In this case, no peer discovery will be performed. This is true for all backends.

This suggests that one (original?) node was removed and another (replacement with a blank database?) failed to join because it failed to contact Kubernetes.

It can be argued that things should fail early and often. Since this plugin has a very specific scope (it's not a cluster monitoring solution, for example), that's the intentional behavior of the peer discovery subsystem at the moment.

michaelklishin commented 6 years ago

@Gsantomaggio is there an idea behind the node cleanup enabled in the examples? It is probably not essential for a demo. Should we disable it to be on the safe side? Scenarios where an example is deployed straight into production are not at all unheard of :(

michaelklishin commented 6 years ago

@drf unless there are specific reasons for that potentially unsafe feature to be used, we should disable it entirely in the demo. We'd appreciate a patch that does it as long as @Gsantomaggio agrees.

As I mentioned above, I am not ruling out the idea of having N retries for the list_nodes/0 operation in the future but that will only help in one specific scenario. A monitoring solution would cover this and many more cases.

drf commented 6 years ago

@michaelklishin indeed you're right - in fact, it is most likely that the failure was in the network layer rather than in the API server, especially given the fact that other clustered services (e.g.: cassandra) experienced similar node failures.

About node behavior and to give further context: all the nodes stayed up, and the cluster was healthy before the event. Other nodes never tried rejoining - what happened was most likely, in a timeline:

  1. Healthy cluster
  2. network goes down - nodes lose connectivity and are cleaned up from the cluster as you correctly pointed out
  3. when nodes leave the cluster, they try invoking list_nodes() on the peer discovery backend to find out if there's a cluster to join - but the API call fails because the API server is unreachable
  4. In this case, RMQ simply assumes there's no cluster to be joined and never tries coming back up

It is interesting that nodes were still alive, and pods were alive as well - in fact, if RMQ had persisted trying to contact those nodes it would have eventually got them after a short while, and that's also the reason why Cassandra recovered. So again this comes to what you mentioned - see previous post.

I agree with your point and I think the best solution is that the cluster should be monitored and pods destroyed accordingly (probably an operator might do an interesting job here) - as I said, my interest is exactly avoiding to drag people living the hardcore life into the void with a potentially dangerous example :/

drf commented 6 years ago

@michaelklishin probably calling list_nodes/0 repeatedly (assuming RMQ does that) might have bad behavior in other RMQ areas.

A solution (trying some ideas here) might be doing this only in the case in which list_nodes/0 returns an error because the discovery backend (k8s, consul...) is (temporarily?) unreachable (and this already happens by returning {Error, Reason} in the backend), and allow the user to configure a grace period in which a retry is reasonable, and hence call list_nodes/0 repeatedly only IF the backend throws an actual error, in which it might be assumed that calling list_nodes repeatedly shouldn't be dangerous.

Again, just throwing random ideas, I still favor the "you should monitor the cluster for failures", which in some cases is just necessary to make sure things don't go rotten

Gsantomaggio commented 6 years ago

Hi @michaelklishin and @drf

In this example, I just wanted to remove the unused nodes, and I agree with you that can be dangerous to use it in production.

We can restore the default behavior by setting cluster_formation.node_cleanup.only_log_warning = true.

At this point, I think we should review also the other rabbitmq-peer-discovery-xxx examples

drf commented 6 years ago

Hi @Gsantomaggio , that's great to hear. Just to make sure that we're not diving into a different problem - consider the following case:

I have a K8S cluster with 4 nodes, and a RMQ SS with 3 replicas and no kind of node affinity. For any reason, rabbitmq-2 is deleted and recreated on a different node, hence I'm 100% sure it will get a new IP and, consequently, a new node name.

Is it possible that having cluster_formation.node_cleanup.only_log_warning = true might have some unintended side effects (such as, the previous node name is expected to rejoin the cluster eventually) in such a situation?

drf commented 6 years ago

Expanding on this one - if what I previously said it's true (I guess it might be), again there is a need for an external job/operator that cleans up unused RMQ nodes in the cluster (which might arguably be easier to implement through a K8S cronjob), and this could probably be documented somewhere

michaelklishin commented 6 years ago

RabbitMQ intentionally does not try to decide when a node should be removed. It's up to the operator to decide. The node cleanup feature lets you trade off some safety for a higher degree of automation.

drf commented 6 years ago

I see. Then it's probably safer to turn that option to true in the example and let the user handle failures in the way he sees fit. In the k8s case, I guess it's safe to say this resorts to avoid scheduling the pod if the API server isn't reachable, keeping that option true, and having a cronjob which makes sure nodes that aren't truly expected to rejoin the cluster anytime soon are cleaned up.

Thanks for the constructive discussion! :)