tikv / pd

Placement driver for TiKV
Apache License 2.0
1.05k stars 720 forks source link

State should not be sync when replica count is not enough #2487

Open BusyJay opened 4 years ago

BusyJay commented 4 years ago

Please answer these questions before submitting your issue. Thanks!

  1. Start 4 TiKV nodes with asynchronous replication mode, two of them are configured with zone: z1, the others are configured with zone: z2.
  2. Wait for all region are balanced.
  3. Configure label, primary, dr, primary-replicas, dr-replicas.
  4. Enable dr-auto-sync.
  5. Wait for PD says it's sync.
  6. Kill all z2 nodes.

After a while, there are still some regions can't serve requests. They have two replicas placed in 2 z2 nodes.

disksing commented 4 years ago

You need other measures to ensure that replicas are 2:1, such as using placement-rules, or adding a rack label under the zone and ensuring that the number of racks is 2:1.

BusyJay commented 4 years ago

I think how to make sure it's 2:1 and when to mark it as sync is two different questions. If there is no other means to make it actually 2:1, PD should keep marking it as async or sync_recover.

disksing commented 4 years ago

I think it is ok to say state is sync. sync means 2 DCs both have full data, which is true. It is ok to be sync when there are regions cannot serve requests.

BusyJay commented 4 years ago

Then there should be another state that shows whether it's guaranteed to be able to fallback to asynchronous replication. The design of synchronous replication is not just about sync, but also the ability to fallback to async in given time. For users like me that have no idea how PD actually works, it's a common misunderstanding.