Initial pre-shutdown health checks

michaelklishin commented 4 years ago

Problem Definition

During upgrades, it may be useful to know if now is a good time to shut a node down. If it has ongoing eager sync (classic queues) or active log reinstallation ("catching up" for quorum queues), shutdown ideally should be delayed. It'd also be a good idea to check if the removal of any nodes might cause certain queues or other components using Raft to lose quorum.

Approaches Considered and Lessons Learned

Checking every queue can be quite intrusive and time-consuming. I suggest that we do a sample of up to N queues. There's a good chance that many queue replicas on a node would be syncing after restoring connectivity.

For quorum queues a sync operation is an inherent part of the protocol and is not "special". Observing is therefore is harder.

Some suggested that we could check leader's commit index and if it has any difference with the commit indices of followers. This would work reasonably well with low operation rates but at 10s of thousands per second, followers will almost always be a little bit behind (fraction of a percent) in practice, which means this approach would produce false positives and become counterproductive.

Examples

Some examples:

rabbitmq-diagnostics check_if_node_is_mirror_sync_critical

rabbitmq-diagnostics check_if_node_is_quorum_critical

Discussed with @mkuratczyk and his team, @gerhard, @kjnilsson, @dcorbacho, @Vanlightly and others.

lukebakken commented 4 years ago

@michaelklishin should the new commands be documented? I don't see them in these places:

michaelklishin commented 4 years ago

Now that some time has passed and we haven't removed or modified these, I think so.

lukebakken commented 4 years ago

OK I'll get it

rabbitmq / rabbitmq-cli