pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.8k stars 5.8k forks source link

support deterministic failover schedule for placement rules #37251

Open morgo opened 2 years ago

morgo commented 2 years ago

Enhancement

My deployment scenario involves two "primary" regions in AWS:

I have been experimenting with placement rules with a third region: us-east-2. This region should only be used for quorum, as there are no application servers hosted in it. So I define a placement policy as follows:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2";

Because the pd-server supports a weight concept, when us-east-1 fails I can have the pd-leader be deterministic and us-west-2 will become the leader. However, there is no deterministic behavior of where the leader of regions for defaultpolicy will go. They will likely balance across us-west-2 and us-east-2, which is not the desired behavior.

Ideally I want the priority for the leader to be in-order of the region-list. This means that us-west-2 will become the new leader for all regions. Perhaps this could be conveyed with syntax like:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2" SCHEDULE="DETERMINISTIC";

In fact, if this worked deterministically for leader-scheduling and follower-scheduling, and extension of this is I could create the following:

CREATE PLACEMENT POLICY `defaultpolicy` PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2,us-west-1" SCHEDULE="DETERMINISTIC";

Since the default followers is 2, it would mean that us-west-1 won't get regions scheduled unless one of the other regions fails, which suits me perfectly. It will also mean that commit latency is only initially bad when failover to us-west-2 first occurs. Over time as regions as migrated to us-west-1, the performance should be ~restored as quorum can be achieved on the west coast.

This is a really common deployment pattern in the continental USA, so I'm hoping it can be implemented :-)

morgo commented 1 year ago

An alternative to this proposal, is to use the leader-weight property that pd can set on stores. But it currently doesn't work as expected:

  1. Assume I have a placement group of PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-west-2,us-east-2"
  2. I set the leader weight to zero on all stores in us-east-2.
  3. When us-east-1 fails, leaders are randomly scattered across us-west-2 and us-east-2
  4. The leader balance schedule does not apply until the cluster is healthy again, preventing failover.

The reason is because in (3) the new leader is chosen by an election by the tikv raft group, which has no knowledge (or concern) for leader-weight. But what I would like to suggest, is that if a heartbeat is sent from a leader in a zero leader-weight store, a forced leader transfer occurs.

I took a look at a quick hack to do this, but it didn't work :-) I'm hoping someone who knows pd better can help here.

nolouch commented 1 year ago

Hi @morgo, if you set the leader weight to zero, the score calculation would like count/weight(1e-6). the balance leader will transfer the leader from us-east-2 to us-west-2 in step 3. but it may cannot transfer out to all leaders, because the balance-leader scheduler's goal is to balance the score.

An alternative method, use label-property. rather than leader score, it will always transfer leader out from the reject leader stores to other stores, the operators :

pd-ctl schedule add label-scheduler // active the label-scheduler
pd-ctl config set label-property reject-leader region us-west-2 // all tikv leaders will transfer to other regions(us-west-2) when failover  

you can check the implementation in: https://github.com/tikv/pd/blob/master/server/schedulers/label.go#L117-L124

morgo commented 1 year ago

This is great! Thank you @nolouch

nolouch commented 1 year ago

I test this scenario with the rule and this scheduler. and found that label-scheduler does not work well as we expect. the log:

[2022/09/02 11:45:59.171 +08:00] [DEBUG] [label.go:139] ["fail to create transfer label reject leader operator"] [error="cannot create operator: target leader is not allowed"]

it shows try to create an operator but failed. it is caused by the placement rule is explicitly specifies that this store should place followers, which is reasonable in the error.

After I change the policy, it works.

for failover, the placement policy should change from :

CREATE PLACEMENT POLICY primary_east PRIMARY_REGION="us-east-1" REGIONS="us-east-1,us-east-2,us-west-2";

to

CREATE PLACEMENT POLICY primary_east_2 LEADER_CONSTRAINTS="[+region=us-east-1]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-wetst-2: 1}"
the difference between them can check the raw rule in PD the raw rule in PD will change from : ``` { "group_id": "TiDB_DDL_71", "id": "table_rule_71_0", "index": 40, "start_key": "7480000000000000ff4700000000000000f8", "end_key": "7480000000000000ff4800000000000000f8", "role": "voter", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-east-1" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ], "create_timestamp": 1662089499 }, { "group_id": "TiDB_DDL_71", "id": "table_rule_71_1", "index": 40, "start_key": "7480000000000000ff4700000000000000f8", "end_key": "7480000000000000ff4800000000000000f8", "role": "follower", "count": 2, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-east-2", "us-west-1" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ], }, ``` to ``` { "group_id": "TiDB_DDL_71", "id": "table_rule_71_0", "index": 40, "start_key": "7480000000000000ff4700000000000000f8", "end_key": "7480000000000000ff4800000000000000f8", "role": "leader", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-east-1" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ], "version": 1, "create_timestamp": 1662089499 }, { "group_id": "TiDB_DDL_71", "id": "table_rule_71_1", "index": 40, "start_key": "7480000000000000ff4700000000000000f8", "end_key": "7480000000000000ff4800000000000000f8", "role": "voter", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-east-2" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ], "version": 1, "create_timestamp": 1662089499 }, { "group_id": "TiDB_DDL_71", "id": "table_rule_71_2", "index": 40, "start_key": "7480000000000000ff4700000000000000f8", "end_key": "7480000000000000ff4800000000000000f8", "role": "voter", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-west-1" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ], "create_timestamp": 1662092753 }, ```
nolouch commented 1 year ago

Hi, @morgo. an easier way is just to use one policy (works in 3 regions) like:

CREATE PLACEMENT POLICY  primary_east_backup_west LEADER_CONSTRAINTS="[+region=us-east-1,-region=us-east-2]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-west-2: 1}"

If want to apply to the cluster level, I think we can use the raw placement rule, example:

show details ``` [ { "group_id": "cluster_rule", "id": "cluster_rule_0_primary", "index": 500, "start_key": "", "end_key": "", "role": "leader", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-east-1" ] }, { "key": "region", "op": "notIn", "values": [ "us-east-2" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ] }, { "group_id": "cluster_rule", "id": "cluster_rule_1_us_west_2", "index": 500, "start_key": "", "end_key": "", "role": "voter", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-west-2" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ] }, { "group_id": "cluster_rule", "id": "cluster_rule_2_us_east_2", "index": 500, "start_key": "", "end_key": "", "role": "follower", "count": 1, "label_constraints": [ { "key": "region", "op": "in", "values": [ "us-east-2" ] }, { "key": "engine", "op": "notIn", "values": [ "tiflash" ] } ] } ] ```
morgo commented 1 year ago

If want to apply to the cluster level, I think we can use the raw placement rule, example:

I'd prefer to keep it in SQL rules, so its easier for other users on my team to change them if needed. It's okay though, the only other schema I need to change is mysql. This is important actually because SHOW VARIABLES reads from the mysql.tidb table for the gc variables. Since various client libraries run show variables like 'x' on a new connection, if this table isn't in the primary region, it is going to cause performance problems.

It can be done with:

mysql -e "ALTER DATABASE mysql PLACEMENT POLICY=defaultpolicy;"
for TABLE in `mysql mysql -BNe "SHOW TABLES"`; do
  mysql mysql -e "ALTER TABLE $TABLE PLACEMENT POLICY=defaultpolicy;"
done;
nolouch commented 1 year ago

Well, do you think the system level is needed in the placement rule in SQL? Is it more friendly for your scenarios? Actually, if we support the system level, there is only one rule in PD. but in the current way, It will create many rules in PD. it's about 3 raw rules per table, I worry about the burden of too many rules.

morgo commented 1 year ago

This is essentially this feature request: https://github.com/pingcap/tidb/issues/29677

There are some strange behaviors that need to be determined, but yes: I think the feature system level has merit.

nolouch commented 1 year ago

Hi, @morgo. an easier way is just to use one policy (works in 3 regions) like:

CREATE PLACEMENT POLICY  primary_east_backup_west LEADER_CONSTRAINTS="[+region=us-east-1,-region=us-east-2]" FOLLOWER_CONSTRAINTS="{+region=us-east-2: 1,+region=us-west-2: 1}"

If want to apply to the cluster level, I think we can use the raw placement rule, example:

show details

@morgo I confirmed that this placement policy cannot achieve the purpose of automatic switching, the problem needs the placement policy to distinguish voter and follower, setting follower raw rule for east2 let it cannot be leader. So if you want to use SQL, you still need to use label scheduler as https://github.com/pingcap/tidb/issues/37251#issuecomment-1234980892 commented.

nolouch commented 1 year ago

BTW, Do you need to set the placement policy for the metadata, I think metadata likes read schema information, which may cause performance problems.

nolouch commented 1 year ago

BTW, Do you need to set the placement policy for the metadata, I think metadata likes read schema information, which may cause performance problems.

I really suggest use cluster-level setting with raw placement rule for this scenario now. in my test met fewer problems. use pd-ctl to setting for keyrange from "" to "", rules like:

// setting rule group  https://docs.pingcap.com/tidb/dev/configure-placement-rules#use-pd-ctl-to-configure-rule-groups
>> pd-ctl config placement-rules rule-group set cluster_rule 2 true
>> cat cluster_rule_group.json
{
    "group_id": "cluster_rule",
    "group_index": 2,
    "group_override": true,
    "rules": [
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_0_primary_leader",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "leader",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-1"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_1_primary_voter",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "voter",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-1"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_3_us_east_2",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "voter",
            "count": 2,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-east-2"
                    ]
                }
            ]
        },
        {
            "group_id": "cluster_rule",
            "id": "cluster_rule_2_us_west_2",
            "index": 1,
            "start_key": "",
            "end_key": "",
            "role": "follower",
            "count": 1,
            "label_constraints": [
                {
                    "key": "region",
                    "op": "in",
                    "values": [
                        "us-west-2"
                    ]
                }
            ]
        }
    ]
}

// apply the rule for the group https://docs.pingcap.com/tidb/dev/configure-placement-rules#use-pd-ctl-to-batch-update-groups-and-rules-in-groups
>>  pd-ctl config placement-rules rule-bundle set cluster_rule --in="cluster_rule_group.json"
tonyxuqqi commented 1 year ago

I confirmed that this placement policy cannot achieve the purpose of automatic switching, the problem needs the placement policy to distinguish voter and follower, setting follower raw rule for east2 let it cannot be leader.

@nolouch Is this follower rule enforced in PD side or in TikV raft protocol?

morgo commented 1 year ago

@nolouch Happy to try with one policy. I'm getting an error with what you pasted above though :(

$ ./bin/pd-ctl config placement-rules rule-bundle set cluster_rule --in="cluster_rule_group.json"
json: cannot unmarshal array into Go value of type struct { GroupID string "json:\"group_id\"" }

Using pd-ctl from v6.2.0.

nolouch commented 1 year ago

@morgo sorry, I updated the comment in https://github.com/pingcap/tidb/issues/37251#issuecomment-1246155328. you can try again.

kolbe commented 1 year ago

To be used in a kubernetes environment (until https://github.com/pingcap/tidb-operator/issues/4678 is implemented), "region" should be changed to "topology.kubernetes.io/region".

SunRunAway commented 1 year ago

@morgo Since you never read from us-east-2, is it better to set us-east-2 as witness if posible?

kolbe commented 1 year ago

@SunRunAway what is "witness"? This is not mentioned anywhere in our documentation.

SunRunAway commented 1 year ago

@kolbe I'm discussing a developing feature, see https://github.com/tikv/tikv/issues/12876