redpanda: reassign replicas off of nodes which have been unavailable for a specific amount of time

rkruze commented 2 years ago

Who is this for and what problem do they have today?

Today, when a node goes down in a Redpanda cluster, it will have all of its leaderships transferred from the offline node. However, unless the node is decommissioned via an admin API call, the replicas assigned to that node will not move to other nodes in the cluster.

This feature request is to have a setting in the cluster that says after a node has been unavailable more than X amount of time, replicas will be reassigned to other nodes in the cluster to be up replicated.

What are the success criteria?

Success would be that if a node was to go down and was down longer than X amount of time, specified by a config in the cluster, with a default of something like 5 minutes, that replicas assigned to that node will be reassigned to other nodes which are up in the cluster.

Why is solving this problem impactful?

This feature will help to automate cluster management and enable things like auto-scaling groups.

Additional notes

This feature is very similar to https://www.cockroachlabs.com/docs/v21.2/architecture/replication-layer#membership-changes-rebalance-repair

JIRA Link: CORE-800

jcsp commented 2 years ago

Couple thoughts from dealing with this kind of thing in storage systems:

The default timeout should be long enough for typical hardware/OS servicing (i.e. long enough to shutdown, click around in iDrac/iLo to install some new firmware, start back up, which is something more like 10-20 mins than 5 mins on a lot of servers).
Should have a well defined interaction with maintenance mode (https://github.com/vectorizedio/redpanda/issues/3020) - this might be that when a node is in maintenance mode, we never auto-rebalance data away from it, because we take maintenance mode as a promise that the outage is temporary.
Need a prometheus stat for "node has been down too long" that can be alerted on, and a direct CLI equivalent of what we do on the timeout, for users who want a "human in the loop" version of this flow.
Should do some queuing of partitions when moving them: prioritize completely moving a smaller number of complete partitions rather than partially moving all the partitions.
Needs a clear way to view progress and cancel, as data movement can be a very long running operation.

rkruze commented 2 years ago

The default timeout should be long enough for typical hardware/OS servicing (i.e., long enough to shutdown, click around in iDrac/iLo to install some new firmware, start back up, which is something more like 10-20 mins than 5 mins on a lot of servers).

On the point of timeouts, what I saw worked well was the ability to adjust the timeout dynamically via a config option for when you knew that updates would occur.

Needs a clear way to view progress and cancel, as data movement can be a very long-running operation.

Agreed with this, we should have a metric to show the queue of "replication reassignment," something similar to, perhaps a metric like "Queue Replication Pending."

We should also have a way to configure a dynamic bandwidth throttle for this operation. You don't want to have a node going offline cause more issues in the cluster. Ideally, if the cluster had Shadow Indexing enabled, the other replicas would replicate from the object store vs. the other nodes.

scallister commented 2 years ago

We should also have a way to configure a dynamic bandwidth throttle for this operation.

What would this look like? I'm imaging it might be something like an upper total bandwidth bound for the node, so combining both regular traffic bandwidth and replication bandwidth and backing off on replication if that upper bound is reached.

redpanda-data / redpanda