redpanda-data / redpanda

Redpanda is a streaming data platform for developers. Kafka API compatible. 10x faster. No ZooKeeper. No JVM!
https://redpanda.com
9.43k stars 580 forks source link

redpanda: reassign replicas off of nodes which have been unavailable for a specific amount of time #3237

Open rkruze opened 2 years ago

rkruze commented 2 years ago

Who is this for and what problem do they have today?

Today, when a node goes down in a Redpanda cluster, it will have all of its leaderships transferred from the offline node. However, unless the node is decommissioned via an admin API call, the replicas assigned to that node will not move to other nodes in the cluster.

This feature request is to have a setting in the cluster that says after a node has been unavailable more than X amount of time, replicas will be reassigned to other nodes in the cluster to be up replicated.

What are the success criteria?

Success would be that if a node was to go down and was down longer than X amount of time, specified by a config in the cluster, with a default of something like 5 minutes, that replicas assigned to that node will be reassigned to other nodes which are up in the cluster.

Why is solving this problem impactful?

This feature will help to automate cluster management and enable things like auto-scaling groups.

Additional notes

This feature is very similar to https://www.cockroachlabs.com/docs/v21.2/architecture/replication-layer#membership-changes-rebalance-repair

JIRA Link: CORE-800

jcsp commented 2 years ago

Couple thoughts from dealing with this kind of thing in storage systems:

rkruze commented 2 years ago

The default timeout should be long enough for typical hardware/OS servicing (i.e., long enough to shutdown, click around in iDrac/iLo to install some new firmware, start back up, which is something more like 10-20 mins than 5 mins on a lot of servers).

On the point of timeouts, what I saw worked well was the ability to adjust the timeout dynamically via a config option for when you knew that updates would occur.

Needs a clear way to view progress and cancel, as data movement can be a very long-running operation.

Agreed with this, we should have a metric to show the queue of "replication reassignment," something similar to, perhaps a metric like "Queue Replication Pending."

We should also have a way to configure a dynamic bandwidth throttle for this operation. You don't want to have a node going offline cause more issues in the cluster. Ideally, if the cluster had Shadow Indexing enabled, the other replicas would replicate from the object store vs. the other nodes.

scallister commented 2 years ago

We should also have a way to configure a dynamic bandwidth throttle for this operation.

What would this look like? I'm imaging it might be something like an upper total bandwidth bound for the node, so combining both regular traffic bandwidth and replication bandwidth and backing off on replication if that upper bound is reached.