During upgrades, it may be useful to know if now is a good time to shut a node down. If it has
ongoing eager sync (classic queues) or active log reinstallation ("catching up" for quorum queues), shutdown ideally should be delayed. It'd also be a good idea to check if the removal of any nodes
might cause certain queues or other components using Raft to lose quorum.
Approaches Considered and Lessons Learned
Checking every queue can be quite intrusive and time-consuming. I suggest that we do a sample of up to N queues. There's a good chance that many queue replicas on a node would be syncing after restoring connectivity.
For quorum queues a sync operation is an inherent part of the protocol and is not "special". Observing is therefore is harder.
Some suggested that we could check leader's commit index and if it has any difference with
the commit indices of followers. This would work reasonably well with low operation rates
but at 10s of thousands per second, followers will almost always be a little bit behind
(fraction of a percent) in practice, which means this approach would produce false
positives and become counterproductive.
Problem Definition
During upgrades, it may be useful to know if now is a good time to shut a node down. If it has ongoing eager sync (classic queues) or active log reinstallation ("catching up" for quorum queues), shutdown ideally should be delayed. It'd also be a good idea to check if the removal of any nodes might cause certain queues or other components using Raft to lose quorum.
Approaches Considered and Lessons Learned
Checking every queue can be quite intrusive and time-consuming. I suggest that we do a sample of up to N queues. There's a good chance that many queue replicas on a node would be syncing after restoring connectivity.
For quorum queues a sync operation is an inherent part of the protocol and is not "special". Observing is therefore is harder.
Some suggested that we could check leader's commit index and if it has any difference with the commit indices of followers. This would work reasonably well with low operation rates but at 10s of thousands per second, followers will almost always be a little bit behind (fraction of a percent) in practice, which means this approach would produce false positives and become counterproductive.
Examples
Some examples:
Discussed with @mkuratczyk and his team, @gerhard, @kjnilsson, @dcorbacho, @Vanlightly and others.