tarantool / vshard

The new generation of sharding based on virtual buckets
Other
98 stars 29 forks source link

Bucket might become inconsistent between replicas #377

Closed Gerold103 closed 1 year ago

Gerold103 commented 1 year ago

After #173 buckets will be protected from GC if replicas have refs, and they can't change to a bad status if there are running requests. For example, a bucket can't be deleted if it has any refs.

However it seems like a couple of replicas might get same tuple in _bucket in different states. The commit 96588fa9acff7085f43de95d215bbda717b24a72 describes such situation. However it needs to be altered because it can't affect requests running on the bucket right now:

- Node1 is a master, node2 is a replica, bucket1 is ACTIVE on both;
- Node1 sends bucket1 to another replicaset;
- The bucket becomes SENT on both nodes;
- Node1 and node2 loose replication. But netbox still works;
- Node1 and node2 drop each other from config and from replication;
- Each of them becomes master and deletes the SENT bucket;
- Node2 receives the bucket from another replicaset again. Now it is ACTIVE here and not present on node1 at all. It wasn't accessed on the node2, so a ref doesn't exists;
- On the other replicaset the bucket becomes SENT and then it is deleted;
- Node1 and node2 again get their old config and restore the replication;
- Node2 receives removal of the bucket from node1;
- Node1 receives data for the deleted bucket from node2, which was re-received from another replicaset;

The result is apparently that either the replication will break which would be good, or what is worse - node2 would not have the bucket and node1 would have it. Also node1 will probably have data with bucket_id which is not stored on node1.

This is of course a result of multiple sequential bad reconfiguration steps. But this particular case could be detected.

As a follow up to #173 it might be feasible to validate bucket state transitions in _bucket:on_replace() trigger. For example, forbid to delete an ACTIVE bucket right away. It should always go through SENDING, SENT, and GARBAGE states. Then in the test above node2 wouldn't accept the bucket removal. Node1 will sync with node2, but node2 won't receive anything from node1. Here node1 has to be rejoined, unfortunately.

The validation itself is easy to do. But it requires to update almost all the tests.

347 also could help. Could reject updates with greater generation.