rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.2k stars 3.91k forks source link

A more automated way to maintain a sufficient minimum of replicas for quorum queues and streams #7209

Closed michaelklishin closed 10 months ago

michaelklishin commented 1 year ago

This issue is meant to serve as a single "design document"-like description of our current understanding of how QQ and stream replicas could be grown in an automated way, without breaking installations where replica management is performed using CLI tools, or by other system parts (such as peer discovery or the stream coordinator).

Original issues

7106, #7110, #7201, #7150.

Problem definition

Because quorum queues and streams depend on a majority of replicas being online via Raft, many users would like to maintain a certain minimal number of replicas (in most cases, 3 or 5). Going above the limit is considered to be a non-issue, at least in this particular "design document".

Originally QQ replica management is only available via CLI tools because dynamic replica management is a much more complex issue than it seems and not everyone agrees on how it should be done.

However, 3+ years later and with at least four RabbitMQ-as-a-Service offerings on the market, it is time to make this more automatic.

Similar problems already solved in RabbitMQ

There is a number of similar problems that RabbitMQ already has solutions for:

The set of features outlined in this document ideally should not require significant changes to the behavior of the features above.

How to configure the limit: policy? x-args (optional argument)? rabbitmq.conf?

Policies is a way to configure things in RabbitMQ dynamically, for a group of objects such as queues, streams, and so on.

Runtime parameters is a way to configure global things dynamically for the lifetime of the node process.

rabbitmq.conf settings are (almost all) static and global.

Optional queue arguments is a way for clients to configure things statically (for the lifetime of a queue or stream).

While all of these options could work for determining a minimum number of replicas a QQ should have, optional arguments were selected for initial QQ replica count because this setting seems naturally static to the core team members who mostly work on QQs.

However, policies match a group of QQs, as would a rabbitmq.conf default, which is a nice property for RabbitMQ-as-a-Service scenarios.

As a result, both optional arguments and policies can be useful, and like with many other settings, both can be supported, with optional arguments taking precedence.

Continuously maintained minimum replica count (CMMRC)

Quorum queues, streams and superstreams already have the idea of initial replica count. What we'd like to achieve here is a continuously maintained minimum replica count, which in practice will be the same number as the initial count but is not the same thing semantically.

The continuously maintained minimum replica count (CMMRC) is both a number and a set of internal mechanisms that would add replicas should a cluster node fail, and some replicas become unavailable.

Specifically we seek to have a mechanism which would add a new replica after a period of time from detecting node or QQ/stream replica failure, using a function that would suggest most suitable nodes for the placement (similar to how peer discovery/cluster formation use a pluggable function).

Going above the limit

It's fine for QQs and streams to have one or multiple extra replicas. Those can be manually removed, as long as the total count does not dip below the CMMRC.

Triggering events

RabbitMQ nodes observe peer node failures, including connectivity link failures. Automatic replica addition therefore can be implemented in reaction to them, or using periodic checks.

What is important is to have a failure recovery interval. Stream coordinator only kicks in some 30-60 seconds after it detects a replica failure. This interval should be configurable.

Once the interval has passed, a placement node can be selected using a pluggable function, and a new replica can be added there using the existing mechanisms.

Event-driven or periodic async checks

The CMMRC checks can be performed in response to various cluster events or periodically. The former would avoid extra periodic load. The latter would be easier to implement. Both can be combined if necessary.

When a cluster even such as node failure or permanent removal is triggered, this can trigger a mass addition of replicas to all QQs and streams, consuming all CPU, network link and potentially disk I/O resources on several nodes for a period of time.

A "periodic async" check performed by queues individually will naturally spread this load over time. This periodic async option seems more appealing to the core team for that reason.

Scenarios and Examples

One

A quorum queue qq.1 with initial replica count and CMMRC = 3 is declared in a five node cluster with nodes A, B, C, D, and E. Its replicas are placed on nodes A, B, C.

Node C has suffered an unrecoverable failure and was deleted from the cluster (say, by an operator). In 60 seconds, an internal QQ monitoring check decided that a new replica must be added to adhere to the configured CMMRC. Node D was selected as for new replica placement by a function that was aware of the deployment infrastructure AZs.

A new replica was added so qq.1 was now hosted on nodes A, B and D. Future CMMRC checks discovered that no action was necessary.

Two

A quorum queue qq.2 with initial replica count and CMMRC = 3 is declared in a five node cluster with nodes A, B, C, D, and E. Its replicas are placed on nodes A, B, C.

Node C has suffered a permanent hardware failure. In 60 seconds, an internal QQ monitoring check decided that a new replica must be added to adhere to the configured CMMRC. Node D was selected.

However, at the same time, an external monitoring tool also detected the failure and a human operator decided to manually add a new replica.

A new replica was added to qq.2 manually and another one by the automatic CMMRC check to node E. So qq.2 was now hosted on nodes A, B, D, and E. Future CMMRC checks discovered that no action was necessary.

Three

A quorum queue qq.3 with initial replica count and CMMRC = 3 is declared in a five node cluster with nodes A, B, C, D, and E. Its replicas are placed on nodes A, B, C. The CMMRC check interval was configured to 90 seconds.

Node C was rebooted and came back in 60 seconds, in another 10 qq.3's failed replica rejoined its peers and caught up (performed Raft log delta reinstallation). In 90 seconds, an internal QQ monitoring check discovered that no action was necessary.

qq.3 was now hosted on nodes A, B, and C. Future CMMRC checks discovered that no action was necessary.

illotum commented 1 year ago

This is pretty much what we've been driving towards with the recent issues. Glad it gets a holistic design review. Several considerations:

kjnilsson commented 1 year ago

In scenario One: "Node C has suffered a permanent hardware failure." - how do we know this? The only reasonable way we can know that I think is by the remove being removed from the rabbit nodes.

michaelklishin commented 1 year ago

@kjnilsson correct, we can assume that the operator has decided to remove it. But I primarily wanted to explain that in that example, the node is not coming back.

michaelklishin commented 1 year ago

I will update the design doc with some pros and cons of "event-driven" vs "async periodic" initiation of new replica placement that @kjnilsson and I have discussed.

First attempt failed with a unicorn response from GitHub 😅

michaelklishin commented 1 year ago

@illotum @simonunge @kjnilsson as far as I am concerned, this issue currently reflects the latest in our thinking. Are there any questions or concerns from you that it does not answer or at least mention?

SimonUnge commented 1 year ago

@michaelklishin One concern I can find is the line "As a result, both optional arguments and policies can be useful, and like with many other settings, both can be supported, with optional arguments taking precedence." - as it will complicate things when running the rabbit-broker as a service.

I did an implementation attempt of adding policies, and auto growth that checks the need to grow periodically (configurable interval), and I let the policy take precedence currently.

adamncasey commented 1 year ago

I'm glad this is being considered - thank you for the write up (and PR!). I have some thoughts:

  1. Scenarios/Examples:

    • What is the expectation in a scenario that a node is not explicitly removed from the cluster, but is down for longer than the monitoring period? I am assuming the queue gains another replica - is this true?

    • Example 3 works out, but is quite close to being pretty unfortunate - e.g. if the 90 second timer went off during node C startup. Triggering a timer (on a distribution to spread load) when the node disconnects might avoid unnecessarily synchronising to a new node. That said, if the timer period is much shorter than 90 seconds this isn't an issue.

  2. Event vs period checks: Avoiding cluster overload during already stressful events like network disruptions seems sensible. Distributing queue balancing over a time period seems like a nice way to solve this issue. It might make sense to detect issues via the triggering event, but delay the response in order to spread load?

  3. Configuration I don't have any use cases where configuring this as a queue argument would be useful. Misconfiguring this value could leave queues unavailable during some infra operations/maintenance so exposing it could be annoying.

SimonUnge commented 1 year ago

@adamncasey Just answering from how the current PR works:

  1. No, a 'down' node is still part of the QQ member, so nothing will happen unless the node is explicitly removed from the cluster/qq membership.
  2. Currently, the 'spread' is per QQ, so each QQ will periodically check its own status. Not node wide periodic check.
illotum commented 1 year ago

I might be misreading, but it looks like there is one omission:

7106 intends to eliminate a class of user errors by setting a global override on the minimum initial replication factor. It does not want to be continuous, nor granular like policies.

7110 intends to be a granular, per use-case, continuous, target replication factor.

CMMRC merges both ideas, but unfortunately cannot be used to accomplish them at once. If we set a global operator policy with CMMRC to do the former, we won't be able to override it with per queue CMMRC.


Does it make sense to let CMMRC accomplish #7110, but also have a global config value per #7106? Perhaps dumb it down to a boolean "enforce_minimum_quorum_of_three=true"?

kjnilsson commented 1 year ago

So having pondered this a bit I think this functionality should also perform removals of members which, unless I am missing something, isn't explicitly covered. (I haven't reviewed all PRs yet in which this may be included)

Regarding removal of members it safe to trigger mass removal events when forgetting a node as there is no log catchup to perform it is reasonably cheap to do and helps with availability during further modifications.

The removal work done in response to a forget_node command can fail for all or a subset of quorum queues so also needs to be performed in the additional periodic check that is performed. The check should do this first. i.e check if any of the currently configured members reside on a node that is no longer part of the RabbitMq cluster and if so remove them from the membership configuration. If this is done any further growth would have to wait for the next interval.

I don't think we should add members to replace members on nodes that are down but not "forgotten". It feels like there is a bit too much scope for nasty race conditions which is exactly why this feature has not been implemented yet

Race conditions are still possible of course, especially when RabbitMq cluster modifications are done by automated processes.

SimonUnge commented 1 year ago

@kjnilsson Currently the removal work is not part of the PR. Interesting though, so you are saying there can be a situation where a QQ has a node in its membership, but the node is no longer part of the cluster, right?

kjnilsson commented 1 year ago

@kjnilsson Currently the removal work is not part of the PR. Interesting though, so you are saying there can be a situation where a QQ has a node in its membership, but the node is no longer part of the cluster, right?

Yea currently if a node hosting QQ members is removed those members are still configured in the quorum queue. This functionality should probably deal with any removals before any additional members are added.

kjnilsson commented 1 year ago

I mean the functionality is simple enough - effectively it is just a matter of executing a rabbitmq-queues shrink command before rabbitmqctl forget_node but it also needs to be part of the periodic check. If you allow growth whilst dead members are configured you may end up in a bad place pretty quickly.

SimonUnge commented 1 year ago

@kjnilsson Currently the removal work is not part of the PR. Interesting though, so you are saying there can be a situation where a QQ has a node in its membership, but the node is no longer part of the cluster, right?

Yea currently if a node hosting QQ members is removed those members are still configured in the quorum queue. This functionality should probably deal with any removals before any additional members are added.

Got it, ok I'll add that to the PR as soon as possible. I think that rabbitmqctl forget_node does do a shrink as part of its logic? But yes, I think it will be easy to add an additional check, to loop over the members and see if they are part of the same cluster. Looking into it.

I am curious to hear what you think about the current interval logic, which piggybacks on the existing tick interval, but uses timestamps to delay itself a few ticks.

SimonUnge commented 1 year ago

@kjnilsson Added some logic to find member-nodes not part of the cluster.

kjnilsson commented 1 year ago

AFAIK forget_cluster_node does not also shrink quorum queues (or streams) you have to do this command first. It would probably make sense to either forget_cluster_node optionally does the shrink first or has an option to fail if there are any members of any queue type on the node in question.

On Wed, 15 Feb 2023 at 00:43, Simon @.***> wrote:

@kjnilsson https://github.com/kjnilsson Added some logic to find member-nodes not part of the cluster.

— Reply to this email directly, view it on GitHub https://github.com/rabbitmq/rabbitmq-server/issues/7209#issuecomment-1430589982, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJAHFGPT64D6YQEF7PRH2LWXQRCHANCNFSM6AAAAAAUUF4BNU . You are receiving this because you were mentioned.Message ID: @.***>

-- Karl Nilsson

SimonUnge commented 1 year ago

@kjnilsson Right, perhaps I was a bit unclear. The rabbit_mnesia:forget_cluster_node/x function does not also shrink. But, the ctl command forget_cluster_node does start with running rabbit_mnesia:forget_cluster_node followed by rabbit_quorum_queue:shrink_all. But it is not atomic, so I guess there could be situations where the call to rabbit_mnesia:forget_cluster_node succeeds but the subsequent call to rabbit_quorum_queue:shrink_all does not.

    case :rabbit_misc.rpc_call(node_name, :rabbit_mnesia, :forget_cluster_node, args) do
      ...
      :ok ->
        case :rabbit_misc.rpc_call(node_name, :rabbit_quorum_queue, :shrink_all, [atom_name]) do
          {:error, _} ->
            {:error,
             "RabbitMQ failed to shrink some of the quorum queues on node #{node_to_remove}"}

          _ ->
            :ok
        end
michaelklishin commented 1 year ago

Sounds like we need a couple of new scenarios added based on the examples provided by @adamncasey and the behavior of node removal on the QQ member set.

michaelklishin commented 10 months ago

I'll assume that #8218 has addressed this, at least to the extent we have consensus on. There's a separate umbrella issue for Quorum Queues vNext™, we can add something incremental there.

But it will take some real world operational experience to understand what may be missing and in what kind of environments.