[enhancement] add cleanup_failures for pruning dead nodes

odysseus654 commented 7 years ago

I have enabled cluster_cleanup/cleanup_interval on my system for removing dead RabbitMQ nodes if they are failing health checks in a Consul cluster (backend=consul)

One issue I have experienced is the cleanup occurs immediately as soon as the check fails. This turns it into a kind of russian roulette for brief outages, where cleanup_interval doesn't matter if a 5 second outage happens to occur at the time that it has checked.

Suggest addition of cleanup_failures for the minimum number of consecutive failures before the node is pruned. While this won't remove the possibility of short outages causing early pruning (there could always be two short outages that happen right at consecutive checks) this does reduce the possibility of them happening.

Examples:

cleanup_interval=60,cleanup_failures=0 - check every 60sec, drop immediately (0=current behavior)
cleanup_interval=60,cleanup_failures=1 - check every 60sec, drop if failed two checks (60s+ outage)
cleanup_interval=60,cleanup_failures=2 - check every 60sec, drop if failed three checks (120s+ outage)
cleanup_interval=600,cleanup_failures=2 - check every 10min, drop if failed three checks (20min+ outage)

FYI for me: I am not running this in a production environment. I have increased cleanup_interval to 600 to reduce the chance of this occurring (and my needing to manually recycle nodes). I am aware I have the option to disable experimental features.

michaelklishin commented 7 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're observing, or at least sharing as much relevant information as possible on the list:

Server, client library and plugin (if applicable) versions used
Server logs
A code example or terminal transcript that can be used to reproduce
Full exception stack traces (not a single line message)
rabbitmqctl status (and, if possible, rabbitmqctl environment output)
Other relevant things about the environment and workload, e.g. a traffic capture

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

michaelklishin commented 7 years ago

That's what CONSUL_DEREGISTER_AFTER (DeregisterCriticalServiceAfter in Consul) is for. It kicks in after a service was marked as in critical condition. Prior to 0.9.0 the key used incorrect case. I'm not sure if it mattered for Consul or not but It was corrected.

It's Consul and not this plugin that keeps track of when it is time to unregister a service. This plugin simply removes the nodes unknown to the backend.

odysseus654 commented 7 years ago

Thank you for your response.

FWIW This issue was unrelated to DeregisterCriticalServiceAfter, the service was flagged as critical/offline but never deregistered from Consul.

I do have log traces for one side of this transaction but not anywhere near the level of information you are asking for, at least for something that is arguably a race condition.

michaelklishin commented 7 years ago

Now that DeregisterCriticalServiceAfter should work correctly, you should be able to use it to. Yes, there is a race condition between service status change in Consul and periodic operations executed by this plugin. However I do not see how the node can be removed earlier and removing it a few seconds later should be fine.

Automatic node cleanup leaves a fair amount of debug log messages behind, so perhaps enable it if you run into the same issue with 0.9.0.

odysseus654 commented 7 years ago

I'm still not convinced this has anything to do with DeregisterCriticalServiceAfter, which in the patched version appears hardcoded at 4m17s and defaults to "infinite".

I did flag this as "enhancement" as it appears to technically be working "as designed" and follows the behavior described by the documentation.

Here is a log fragment of ~2min period from the logs that contains the most recent incident, where the server appears to disappear at around 04:04:30 and attempts to reconnect around 04:05:13: https://gist.github.com/odysseus654/c8d2a4dcb225333b4fc0a83962b739fb

Otherwise I do recognize you have a lot on your plate and I do appreciate the work you do on this project, whether this issue remains closed or not.

odysseus654 commented 7 years ago

I was not running the most recent version of the plugin when this happened, I do recognize that is valid cause to close

michaelklishin commented 7 years ago

@odysseus654 I didn't mean to suggest that there is nothing to improve here, I just think that there are sufficient knobs to tune around automatic node removal. The whole feature is a very sharp blade and we recommend against it more often than not (but with AWS autoscaling groups, it is very useful).

I'm not 100% sure what cleanup_failures, a counter, would really improve over a single Consul-driven cleanup interval. I'm happy to reopen or discuss this on rabbitmq-users.

The Consul backend already has quite a few configurable things and we try to be careful with adding even more.

Thank you for understanding that OSS maintainers have dozens of things to worry about at any given moment. Few users express empathy when their issues are closed.

rabbitmq / rabbitmq-autocluster

[enhancement] add cleanup_failures for pruning dead nodes #51