Doubled buckets - Githubissues

Serpentian commented 1 year ago

The situation, when buckets with the same bucket_id and active status appears on the different shards, is pretty common. It stops rebalancing process, as the total number of buckets becomes more, than it should be according to configuration.

The process of restoring the cluster is pretty tedious and potentially includes data loss:

Use this snippet on the router, connected to the damaged cluster, to find all doubled buckets. The output will look something like this:
```
unix/:./data/router_1.control> find_doubled()
---
- - count: 2
info:
- {'status': 'active', 'id': 1}
- {'status': 'active', 'id': 1}
uuids:
- ac522f65-aa94-4134-9f64-51ee384f1a54
- cbf06940-0790-498b-948d-042b62cf3d29
...
```
In this example only one bucket (with bucket_id = 1) became doubled, most likely, you'll have much more of them. The output shows on which replicaset uuids the buckets are duplicated.
Go to the masters of the above-mentioned replicasets and decide, which of the buckets is newer. Delete bucket with the data, which is related to it, on the master, where data is older, with this snippet by calling full_delete_bucket(<bucket_id>) (it will delete all data with bucket_id in all sharded spaces!)

The last time, this problem happened, was on 0.1.21 version with the following setup: 30000 buckets, 8 shards with 2 replicas in each of one, one sharded asynchronous space on vinyl. The weight of one replicaset was dropped to 0, after which rebalancing process stopped at some point with error Total active bucket count is not equal to total. Possibly a boostrap is not finished yet. Expected 30000, but found 30005. Unfortunately, no logs were left. It's definitely a bug in rebalancer or gc fiber, we should try to reproduce and fix it.

Associated tickets:

R-omk commented 1 year ago

@Serpentian , was synchronous replication enabled (is_sync = true) for any spaces involved in the rebalancing?

Serpentian commented 1 year ago

@R-omk, I don't have such information. I forgot to mention, that only one space was sharded, it was on vinyl, if it helps in any way. Why? Is there any problem with synchronous spaces during rebalancing?

R-omk commented 1 year ago

Is there any problem with synchronous spaces during rebalancing?

The rebalancing of synchronous spaces is very likely to fail.

https://github.com/tarantool/tarantool/issues/8505

Serpentian commented 1 year ago

It indeed can be part of the problem, but not in the recent case. Space was asynchronous, we should find the problem. Thanks a lot, anyway.

Gerold103 commented 1 year ago

I can think of 3 ways to get a duplicate bucket.

1 - https://github.com/tarantool/vshard/issues/214, it is already known. But I don't know how likely it can happen in reality.

2 - If manual vshard.storage.bucket_send() is used. Imagine this situation: storage S1 has bucket B. It is sent to S2, but then connection broke. The bucket is in the state S1 {B: sending-to-s2}, S2 {B: active}. Now if the user will do vshard.storage.bucket_send(S2 -> S3), then we will get this: S1 {B: sending-to-s2}, S2: {}, S3: {B: active}. Now when recovery fiber will wakeup on S1, it will see that B is sending-to-s2 but S2 doesn't have the bucket. Recovery will then assume that S2 already deleted B, and will recover it on S1. Now we have S1 {B: active} and S3 {B: active}. This is quite easy to achieve if one uses manual vshard.storage.bucket_send(). When automatic rebalancing is used, it shouldn't happen normally, because the rebalancer won't try to apply new routes while there are sending or receiving buckets anywhere.

3 - The assumption in the last sentence in [2] might break probably due to master changes. Imagine that a replicaset had replicas R1 {= master} and R2. They lost connection between each other. R1 has a bucket which is sending/receiving but is already recovered to active state in another replicaset. Now R1 is demoted and R2 becomes master. It doesn't have this bucket in any state. And rebalancer will happily apply new routes. Then we might get the same situation as in [2]. Or something like that.

Anyway, it might be helpful to finally introduce a bucket generation persisted in _bucket to see which duplicate is newer.

And it might be the time to rethink the recovery and/or bucket sending process. Maybe the fix would be as simple as to make vshard.storage.bucket_send() turn local bucket to SENT and only then the remote bucket to ACTIVE. Currently it is vice versa. Then [2] becomes not possible. If connection would be broken, then the local SENT bucket would be deleted eventually, and the remote RECEIVING would be turned to ACTIVE by recovery when the connection is back.

Serpentian commented 1 year ago

If you don't mind I'll create ticket for manual bucket_send from what you've written and fix it.

Let's also link tickets, which are associated with doubled buckets to this one, as we should not close current one: we don't really know, when it's fully fixed.

Serpentian commented 3 weeks ago

Since now vshard prohibits deleting non garbage buckets, you can use box.space._buckets:run_triggers(false). After deleting the bucket you should update cache:

vshard.storage.internal.bucket_count_cache = box.space._bucket:count()

Gerold103 commented 3 weeks ago

Since now vshard prohibits deleting non garbage buckets, you can use box.space._buckets:run_triggers(false). After deleting the bucket you should update cache:

vshard.storage.internal.bucket_count_cache = box.space._bucket:count()

Hm? Is it related to this ticket?

Serpentian commented 3 weeks ago

Hm? Is it related to this ticket?

Since this ticket includes the basic instructions on recovering cluster after doubled buckets, I added this here) Today we encountered this situation one more time. Looked very similar to the https://github.com/tarantool/vshard/issues/214

sergepetrenko commented 3 weeks ago

That time it happened on vshard 0.1.26, where the problem of manual bucket_send (#414) is presumably fixed.

Here's what happened roughly:

A new replica set was added to the cluster
Rebalancer started sending data to the new replicaset, which presumably caused performance problems on the whole cluster (actual rebalancer settings are not known but most likely are the default rebalancer_max_sending = 15, rebalancer_max_receiving = 100)
When performance degradation was noticed the new replica set weight was changed to zero to revert the change. This happened 15-30 minutes after the new replica set introduction.
By that time 13 buckets were transferred to the new replicaset. It's interesting that there were also 13 "old" replica sets which transferred the buckets, so it seems like 1 bucket was transferred from each replica set.
The buckets failed to transfer back due to bucket gc issues (buckets were still in sent state on the original replica sets) (#490).
After some unclear actions (admins say they tried restarting the new replica set a couple of times) 10 buckets were transferred to old replica sets successfully, but 3 buckets remained and failed to transfer.
The remaining 3 buckets were still in sent state on old replica sets, so it was decided to drop the new replica set and replace bucket state from sent to active manually on the old replica sets.
Later on it was discovered that 10 doubled buckets appeared. The ones that "magically" returned to the old replica sets due to a sequence of new replica set restarts.
Doubled buckets were fixed as it's described in this ticket.

P.S. an important addition to the original recipe:

Go to the masters of the above-mentioned replicasets and decide, which of the buckets is newer. Delete bucket with the data, which is related to it, on the master, where data is older, with this snippet by calling full_delete_bucket() (it will delete all data with bucket_id in all sharded spaces!)

Most likely both bucket versions will have some updates. In the yesterday's scenario we've seen that some routers pointed to one replica set, and others to the other for the same bucket. So the updates made during this time were also split between the two versions of the bucket, there was no "newer" bucket.

Gerold103 commented 3 weeks ago

AFAIU, #414 remains fixed, it was about manual bucket_send, not automatic. In the scenario above 1) deleting active buckets and resurrection of sent buckets eliminates any kinds of automatic guarantees. A bucket could be actually active somewhere but wasn't seen or was missed, and an activation of an older sent-bucket of course would lead to 2 active buckets, 2) the multiple instance restarts makes me think of the reason 3 from here https://github.com/tarantool/vshard/issues/412#issuecomment-1516911263 had anything to do with it. In that case any solutions would need to involve synchronisation inside replicaset on each bucket status change. To put it short - need to make _bucket synchronous.

tarantool / vshard

Doubled buckets #412