Data may be stored on only a single server shortly after startup

aphyr commented 5 years ago

TiDB 3.0.0-rc.2, by design, starts up with a region with only a single replica, regardless of the configured target number of replicas. PD then gradually adds additional replicas until target-replicas is reached. In addition, any regions which are split from this initial region also start with the same number of replicas as the target region, until PD can expand them.

This is not a problem in itself, but it does lead to an awkward possibility: in the early stages of a TiDB cluster, data may be acknowledged, but stored only on a single node, when the user expected that data to be replicated to multiple nodes. A single-node failure during that period could destroy acknowledged writes, or render the cluster partly, or totally, unusable. In our tests with Jepsen (which use small regions and are extra-sensitive to this phenomenon), a single-node network partition as late as 500 seconds into the test can result in a total outage, because some regions are only replicated to 1 or 2, rather than 3, nodes.

The configuration parameter for replica count is called max-replicas, which is sort of an odd name, because regions could have exactly max-replicas (the happy case), fewer than max-replicas (e.g. PD hasn't gotten around to resizing that region yet), or more than max-replicas (e.g. during handoff when a node is declared dead). It might be best to call this target-replicas?

I'd also like to suggest that when a cluster has a configured replica count, TiDB should disallow transactions on regions which don't have at least that many replicas in their Raft group. That'd prevent the possibility of a single-node failure destroying committed data, which is something I'm pretty sure users don't expect to be possible!

aphyr commented 5 years ago

I've been experimenting with different settings to see how long it takes for all regions to be fully replicated. With the settings we were using (region-max-keys = 5 & region-split-keys = 2), it could take over an hour for the cluster to converge. While increasing the region size reduces the number of regions, and generally speeds up convergence, it still looks like TiDB can take 80+ minutes to fully replicate regions with region-max-keys = 500 and region-split-keys = 400.

In this case, it looks like most regions got to 3 or 2 replicas early in the test, but for some reason, new regions keep getting created, and old ones being destroyed, while no progress is being made on the original regions. I've set up jepsen 0268340 to log region IDs and replica counts: in an hour and 20 minutes, we go from

2019-05-31 12:19:50,727{GMT}    INFO    [jepsen node n1] tidb.db: :region-replicas {4 1,
 16 3,
 18 3,
 20 3,
 22 3,
 24 3,
 28 3,
 30 3,
 32 3,
 34 3,
 36 3,
 38 2,
 40 3,
 42 2,
 44 2,
 46 1,
 48 1,
 208 3}

to

INFO [2019-05-31 13:40:03,747] jepsen node n2 - tidb.db :region-replicas {4 2,
 16 3,
 18 3,
 20 3,
 22 3,
 24 3,
 28 3,
 30 3,
 32 3,
 34 3,
 36 3,
 38 2,
 40 3,
 42 2,
 44 2,
 46 2,
 48 2,
 4055 3}

That final region, id=4055, gets replaced by a new, higher region every few seconds. I'm not sure why this is the case--we're not actually making any writes, or even connecting clients to this cluster at this point. It looks pretty well stuck. Full logs are here: 20190531T121838.000-0400.zip

With Jepsen 0268340, try something like this to reproduce. It may take a few runs--it doesn't seem to get stuck every time.

lein run test -w append --time-limit 30 --concurrency 2n --version 3.0.0-rc.2 --nemesis partition-pd-leader --nemesis-interval 10 --nemesis-schedule fixed --test-count 30

siddontang commented 5 years ago

PTAL @nolouch

nolouch commented 5 years ago

@aphyr How do you check bootstrap step, it takes 80+ minutes but I found the log that just cost one minutes, like region 46:

[2019/05/31 09:19:49.294 -07:00] [INFO] [operator_controller.go:215] ["add operator"] [region-id=46] [operator="\"make-up-replica (kind:region,replica, region:46(17,1), createAt:2019-05-31 09:19:49.29461602 -0700 PDT m=+62.228962485, cu
[2019/05/31 09:19:49.294 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=create]
[2019/05/31 09:19:49.308 -07:00] [INFO] [cluster_info.go:546] ["region ConfVer changed"] [region-id=46] [detail="Add peer:{id:214 store_id:9 is_learner:true }"] [old-confver=1] [new-confver=2]
[2019/05/31 09:19:49.308 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=heartbeat]
[2019/05/31 09:19:53.328 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="add learner peer 214 on store 9"] [source=heartbeat]
[2019/05/31 09:19:53.396 -07:00] [INFO] [operator_controller.go:317] ["send schedule command"] [region-id=46] [step="promote learner peer 214 on store 9 to voter"] [source=heartbeat]
[2019/05/31 09:19:53.413 -07:00] [INFO] [cluster_info.go:546] ["region ConfVer changed"] [region-id=46] [detail="Remove peer:{id:214 store_id:9 is_learner:true },Add peer:{id:214 store_id:9 }"] [old-confver=2] [new-confver=3]
[2019/05/31 09:19:53.413 -07:00] [INFO] [operator_controller.go:93] ["operator finish"] [region-id=46] [operator="\"make-up-replica (kind:region,replica, region:46(17,1), createAt:2019-05-31 09:19:49.29461602 -0700 PDT m=+62.228962485,

you can grep make-up-replica and found the last log time is 2019/05/31 09:20:01, and the PD start time is:

[2019/05/31 09:18:47.077 -07:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]

and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the target-replicas then start the TiDB, because TiDB will actively split region by creating some system tables even without any writes.

aphyr commented 5 years ago

How do you check bootstrap step

We check for bootstrapping by performing an HTTP GET of PD's /regions API after starting PD, TiKV, and TiDB, and checking to ensure that all regions have at least 3 members which are not flagged with is_learner. See db.clj for details.

I found the log that just cost one minutes, like region 46

Region 46 starting quickly is great, but I'm concerned that the highest region never seem to stabilize--do you know what might be going on there?

and a better way to bootstrap the cluster is to wait for the first region make up itself replicas to the target-replicas then start the TiDB, because TiDB will actively split region by creating some system tables even without any writes.

OK! I'll rewrite the setup code to block before starting TiDB. Do you think TiDB running could be preventing the final region from converging?

nolouch commented 5 years ago

the final region is always the last range from [last_key, ""], so if new table create, or bigger primary record insert than last one in the last created table, the entry encodes must bigger than last_key, and the range will split to 2 range[last_key, new_encode_key], [new_encode_key, ""], and the id of the final region will be renewed.

pingcap / tidb

Data may be stored on only a single server shortly after startup #10657