Closed drf closed 6 years ago
Inability to contact a dependency (such as Kubernetes API endpoint) is considered to be a fatal event by design. There are no plans to revisit it without a substantial volume of user feedback, which takes time.
Use cluster monitoring tools and recreate the nodes that are not in the state you expect. It's a good idea regardless of whether peer discovery is used.
Peer discovery: removing unknown node rabbit@10.244.0.12 from the cluster
as Node Cleanup explains, this feature is
I see. It's worth noting that the cluster wasn't a production cluster (I would have noticed the failure before a week) and it was just a field test using the example provided in this repository.
In particular, the behavior you describe is indeed enabled in your example https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/blob/master/examples/k8s_statefulsets/rabbitmq_statefulsets.yaml#L40 , probably because you assume nodes might reasonably die because of a scaledown of the StatefulSet or pods getting destroyed, and when they come up again they might have a different IP/node name. Correct me if I'm wrong here.
If that is the case (hence that option is needed on Kubernetes), nodes should be able to recognize a cluster failure and behave accordingly, if not in the plugin at least in the statefulset configuration.
So from my POV:
In case 2, I'm willing to come up with a patch
Thanks
As mentioned in the docs, RabbitMQ never removes nodes from a cluster except for the opt-in feature above (or when the operator uses CLI tools to do that).
Nodes that are already cluster members do not perform peer discovery:
If a node previously was a cluster member, it will try to contact its "last seen" peer for a period of time. In this case, no peer discovery will be performed. This is true for all backends.
This suggests that one (original?) node was removed and another (replacement with a blank database?) failed to join because it failed to contact Kubernetes.
It can be argued that things should fail early and often. Since this plugin has a very specific scope (it's not a cluster monitoring solution, for example), that's the intentional behavior of the peer discovery subsystem at the moment.
@Gsantomaggio is there an idea behind the node cleanup enabled in the examples? It is probably not essential for a demo. Should we disable it to be on the safe side? Scenarios where an example is deployed straight into production are not at all unheard of :(
@drf unless there are specific reasons for that potentially unsafe feature to be used, we should disable it entirely in the demo. We'd appreciate a patch that does it as long as @Gsantomaggio agrees.
As I mentioned above, I am not ruling out the idea of having N retries for the list_nodes/0
operation in the future but that will only help in one specific scenario. A monitoring solution would cover this and many more cases.
@michaelklishin indeed you're right - in fact, it is most likely that the failure was in the network layer rather than in the API server, especially given the fact that other clustered services (e.g.: cassandra) experienced similar node failures.
About node behavior and to give further context: all the nodes stayed up, and the cluster was healthy before the event. Other nodes never tried rejoining - what happened was most likely, in a timeline:
It is interesting that nodes were still alive, and pods were alive as well - in fact, if RMQ had persisted trying to contact those nodes it would have eventually got them after a short while, and that's also the reason why Cassandra recovered. So again this comes to what you mentioned - see previous post.
I agree with your point and I think the best solution is that the cluster should be monitored and pods destroyed accordingly (probably an operator might do an interesting job here) - as I said, my interest is exactly avoiding to drag people living the hardcore life into the void with a potentially dangerous example :/
@michaelklishin probably calling list_nodes/0 repeatedly (assuming RMQ does that) might have bad behavior in other RMQ areas.
A solution (trying some ideas here) might be doing this only in the case in which list_nodes/0 returns an error because the discovery backend (k8s, consul...) is (temporarily?) unreachable (and this already happens by returning {Error, Reason} in the backend), and allow the user to configure a grace period in which a retry is reasonable, and hence call list_nodes/0 repeatedly only IF the backend throws an actual error, in which it might be assumed that calling list_nodes repeatedly shouldn't be dangerous.
Again, just throwing random ideas, I still favor the "you should monitor the cluster for failures", which in some cases is just necessary to make sure things don't go rotten
Hi @michaelklishin and @drf
In this example, I just wanted to remove the unused nodes, and I agree with you that can be dangerous to use it in production.
We can restore the default behavior by setting cluster_formation.node_cleanup.only_log_warning = true
.
At this point, I think we should review also the other rabbitmq-peer-discovery-xxx
examples
Hi @Gsantomaggio , that's great to hear. Just to make sure that we're not diving into a different problem - consider the following case:
I have a K8S cluster with 4 nodes, and a RMQ SS with 3 replicas and no kind of node affinity. For any reason, rabbitmq-2 is deleted and recreated on a different node, hence I'm 100% sure it will get a new IP and, consequently, a new node name.
Is it possible that having cluster_formation.node_cleanup.only_log_warning = true
might have some unintended side effects (such as, the previous node name is expected to rejoin the cluster eventually) in such a situation?
Expanding on this one - if what I previously said it's true (I guess it might be), again there is a need for an external job/operator that cleans up unused RMQ nodes in the cluster (which might arguably be easier to implement through a K8S cronjob), and this could probably be documented somewhere
RabbitMQ intentionally does not try to decide when a node should be removed. It's up to the operator to decide. The node cleanup feature lets you trade off some safety for a higher degree of automation.
I see. Then it's probably safer to turn that option to true in the example and let the user handle failures in the way he sees fit. In the k8s case, I guess it's safe to say this resorts to avoid scheduling the pod if the API server isn't reachable, keeping that option true, and having a cronjob which makes sure nodes that aren't truly expected to rejoin the cluster anytime soon are cleaned up.
Thanks for the constructive discussion! :)
In a K8S cluster where RMQ 3.7.1 was installed with 3 nodes and the k8s peer discovery, there has been a temporary failure in the API server/internal networking, which led to the following events:
The nodes have been detached from the cluster and, despite the fact the failure of the API server was temporary and it recovered some minutes later, they never rejoined the cluster, leaving them hanging in the k8s service. Other clustered services, such as Cassandra, didn't experience the same issue, and managed to reconnect successfully:
The peer discovery plugin should be able to recover from such failures and restore the cluster state.