Allow partial rebalancing

Gerold103 commented 6 years ago

The first rebalancer version can not rebalance a cluster, if at least one replicaset contains a bucket in not ACTIVE and not SENT state. Next rebalancer must be able to partially rebalance a cluster. For example:

replicaset1 buckets: {1, ACTIVE}, {2, ACTIVE}, {3, SENDING}
replicaset2 buckets: {3, RECEIVING}, {4, ACTIVE}, {5, ACTIVE}
replicaset3 buckets: {6, ACTIVE}, {7, ACTIVE}, {8, ACTIVE}, {9, ACTIVE}
replicaset4 buckets: {10, ACTIVE}, {11, ACTIVE}

In such a case replicaset3 and 4 do not participate in a current rebalancing process, but they can participate in a next one - replicaset3 can move one bucket to a replicaset4. The first rebalancer version can not do it. Blocked by #3.

Gerold103 commented 5 years ago

Possible algorithm: download bucket counts from all storages and calculate etalon balance as usual, but just do not send routes to already busy storages assuming that they already have their routes. There are two "problems" which actually can be ignored since they do not break convergence of the algorithm.

A bucket can be accounted twice, if it is downloaded once as a sending one, and them from a destination as an active one. It is rather rare case, especially when buckets are big enough. In such a case new routes will send less buckets than needed, but in a right direction, so it improves the balance anyway. Other buckets will be sent on one of next rounds.
A bucket can be not accounted, if it is seen as a receiving on a destination and then as already garbage on a source. In such a case we will see, that bucket count is < total and will skip this round.

VifleY commented 4 years ago

Hello @Gerold103 . I would like to work on this.

Currently, I believe there are 2 use cases to this feature, assuming your proposed algorithm implemented 1) We add a new replicaset while balancer is running: That should work just fine, because we will ignore busy nodes and we shall transfer at least some amount of buckets from non busy nodes to the new replicaset, speeding up the whole process. 2) We are changing weight of existing replicaset and that creates two more possibilities: 2.1) We are changing weight of one of non busy nodes and that should still work fine - we will recalculate etalon_bucket_count for each node, and we will start transitioning of at least some workload (I believe, that it should work whenever we are increasing or decreasing replicaset weigh) 2.2) We change the weigth of busy replicaset, ex. if we want to drain node, which currently have some buckets in RECEIVING state. As I know there are ways to do it manually (via bucket_send), but according to README that is not a preferred way. So in that case we will have to wait until all of buckets are active to start rebalancing that node again (and that could take some time with default REBALANCER_IDLE_INTERVAL). I beleive that issue can be addresed in next iteration (by adjusting route building).

As I see it now, to implement it I shall return bucket_transferring_count from rebalancer_download_states to calculate etalon_bucket_count and then filter replicasets with those transferring buckets before routes building (to avoid any changes to them). I would appreciate if you could tell me if I am missing anything.

I have some issues testing this locally, so it should take a while for me to make things work proper

Gerold103 commented 4 years ago

Hi! The most important usecase is that some nodes may face errors while applying routes. They may include network errors, or conflicts by data in spaces (duplicates, for instance). This thing happens relatively often, in my experience in customer support. Your use cases are rather rare. Talking of the algorithm which I proposed - this is just one possible solution. You are free and welcome to create your own. My version is unlikely the best. Talking of the testing problems - vshard uses table.deepcopy function which is currently broken in Tarantool core. I can't run some tests too. See https://github.com/tarantool/tarantool/issues/4770. Talking of your examples. 1) Adding a new replicaset when others work. Unfortunately, my algorithm won't work just fine, because other replicasets should start sending buckets to this new replicaset. But they won't, because they are busy. We probably need to be able to cancel rebalancing for that case, and restart it with new routes. Ability to cancel rebalancing is my old dream, which would make a cluster more flexible for topology changes. 2.1. Change weight of a non-busy node. My algorithm won't work fine by the same reason as in (1). If you change weight to a bigger value, other replicasets should start sending buckets to the changed one. But they won't. 2.2. Basically the same.

On the whole, I would try to add a cancellation/a restart to the algorithm. That would allow to stop rebalancing which is not going to lead a cluster towards a more balanced state anyway. And apply cancellation every time, when we see, that topology or weights are changed. Then restart the rebalancing with new routes. Issue here is that rebalancer is stateless. It means, that you can't detect that weights or topology are changed. Because rebalancer does not remember any state, and therefore can't compare two states of the cluster. You can try to make it stateful. It would help to more precisely react on changes in the cluster.

I will think about this more in background. Currently I have an idea, that we make the rebalancer calculate routes even for busy nodes, and send them just as usual. A busy node, received routes, will check whether the received routes contradict with the routes currently being applied. And if they do, the node restarts rebalancing with the new routes. We can do that, because route applier is not state less. It remembers which routes are being applied now. With that way we basically brought cancellation ability to nodes instead of rebalancer. Still not sure though how rebalancer should account SENDING and RECEIVING buckets. My original algorithm is not really good and specific in that part.

Gerold103 commented 4 years ago

Another option - lets make rebalancer send small route sets. In that case each pack of routes would be applied earlier, and rebalancer would be able to calmly check cluster state more often. However this may slowdown the main path of rebalancing, when everything is fine. Because we will have gaps between steps.

Gerold103 commented 4 years ago

Another option. Let each node keep a value called is_in_balance. This is initially false everywhere. When the rebalancer wakes up, it checks balance in the cluster. If balance is ok, it sends a command to all nodes to set is_in_balance = true. New nodes have is_in_balance false, and the flag is reset to false when weight is changed. When rebalancer wakes up again and sees is_in_balance false anywhere, it sends route cancellation request to all nodes, builds routes, and sends them. When a node receives routes to apply, it sets is_in_balance = true immediately, and starts applying the routes. Key difference will be that the node won't stop applying the routes in case of an error. Instead it will retry until rebalancer sends a cancellation request due to a newer cluster change. This is the most promising solution IMO. We can keep the rebalancing really simple, just like now. We keep it stateless. And we introduce rebalancing cancellation feature.

VifleY commented 4 years ago

Currently I have an idea, that we make the rebalancer calculate routes even for busy nodes, and send them just as usual. A busy node, received routes, will check whether the received routes contradict with the routes currently being applied. And if they do, the node restarts rebalancing with the new routes.

Basically I was thinking about the same idea, but I believe that current implementation of rebalancer_build_routes does not guarantee, that we will have the exactly same route set on different runs (due to pairs() iteration). Seems that it is quite easy to fix, but it is still required to make more complex route building algorithm that takes into account which buckets are transferring right now and which are availiable for transferring. The advantage of this way is that it is quite easy to test and we keep bucket transferring process the same, though the logic may be quite complex.

Still I like your proposed idea of rebalance cancellation. However I am concerned about infinite retry while applying routes. Shouldn't it be limited by some configuration parameter with sane default value? Seems that exceeding this limit would mean some serious problems in cluster and we should notify about it. After node will be marked as is_in_balance = false and will retry this on next rebalance round.

I will try to work on cancelling approach and notify if I am able to make any progress

VifleY commented 4 years ago

And I think is_in_balance have to be updated on rebalancer_disbalance_threshold change, but that could lead to some inefficient cancellations. Ex: we are adding new replicaset, rebalancer initiate bucket transfer and then threshold is changed, aborting started process. Not sure if that is an important case.

tarantool / vshard

Allow partial rebalancing #41