pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.2k stars 5.84k forks source link

DDL hangs, DML reports PD server timeout and many regions miss peers #22117

Closed cosven closed 3 years ago

cosven commented 3 years ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. use tiup deploy a nightly version tidb cluster with 1tidb/1pd/1tikv
  2. Kill PD/TiKV/TiDB multiple times.

pd.toml

[schedule]
leader-schedule-limit = 4
region-schedule-limit = 2048
replica-schedule-limit = 64

tidb.toml

[binlog]
enable = false
ignore-error = false

[log]
slow-threshold = 300

tikv.toml

[readpool]
[readpool.coprocessor]
use-unified-pool = true
[readpool.storage]
use-unified-pool = false
  1. recover all chaos
  2. use netcat to check ports(pd:2379 tikv:20160 tidb:4000), they all work well

2. What did you expect to see? (Required)

The cluster works well after the chaos recover.

3. What did you see instead (Required)

The cluster does not work anymore.

  1. DDL hangs

    [2020/12/31 00:33:07.299 +00:00] [WARN] [backoff.go:342] ["pdRPC backoffer.maxSleep 40000ms is exceeded, errors:\nregion not found for key \"t\\x80\\x00\\x00\\x00\\x00\\x00\\n\\x8e_r\" at 2020-12-31T00:33:02.638778103Z\nregion not found for key \"t\\x80\\x00\\x00\\x00\\x00\\x00\\n\\x8e_r\" at 2020-12-31T00:33:04.754918433Z\nregion not found for key \"t\\x80\\x00\\x00\\x00\\x00\\x00\\n\\x8e_r\" at 2020-12-31T00:33:07.299888834Z"]
    [2020/12/31 00:33:07.300 +00:00] [ERROR] [ddl_worker.go:720] ["[ddl] run DDL job error"] [worker="worker 2, tp add index"] [error="[tikv:9001]PD server timeout"]
    [2020/12/31 00:33:07.306 +00:00] [INFO] [ddl_worker.go:499] ["[ddl] run DDL job failed, sleeps a while then retries it."] [worker="worker 2, tp add index"] [waitTime=1s] [error="[tikv:9001]PD server timeout"]
    [2020/12/31 00:33:07.744 +00:00] [INFO] [ddl_worker.go:569] ["[ddl] DDL job need to wait dependent job, sleeps a while, then retries it."] [worker="worker 1, tp general"] [jobID=2875] [dependentJobID=2801] [waitTime=200ms]
    [2020/12/31 00:33:08.306 +00:00] [INFO] [ddl_worker.go:774] ["[ddl] schema version doesn't change"] [worker="worker 2, tp add index"]
    [2020/12/31 00:33:08.308 +00:00] [INFO] [ddl_worker.go:599] ["[ddl] run DDL job"] [worker="worker 2, tp add index"] [job="ID:2801, Type:add index, State:running, SchemaState:write reorganization, SchemaID:1, TableID:2702, RowCount:0, ArgLen:0, start time: 2020-12-30 21:52:02.806 +0000 UTC, Err:[tikv:9001]PD server timeout, ErrCount:229, SnapshotVersion:0"]
    [2020/12/31 00:33:10.783 +00:00] [INFO] [ddl_worker.go:569] ["[ddl] DDL job need to wait dependent job, sleeps a while, then retries it."] [worker="worker 1, tp general"] [jobID=2875] [dependentJobID=2801] [waitTime=200ms]
    [2020/12/31 00:33:13.820 +00:00] [INFO] [ddl_worker.go:569] ["[ddl] DDL job need to wait dependent job, sleeps a while, then retries it."] [worker="worker 1, tp general"] [jobID=2875] [dependentJobID=2801] [waitTime=200ms]
    ^C
  2. dml reports PD server timeout

    
    mysql> begin;
    Query OK, 0 rows affected (0.00 sec)

mysql> select * from v limit 1; ERROR 9001 (HY000): PD server timeout


3. pd region check miss peer

root@ce61d5126d28:/disk1/deploy# ~/.tiup/components/ctl/v4.0.9/pd-ctl region check miss-peer | jq .count 512


### 4. What is your TiDB version? (Required)

<!-- Paste the output of SELECT tidb_version() -->

pd

root@ce61d5126d28:/disk1/deploy# /disk1/deploy/pd-2379/bin/pd-server -V Release Version: v4.0.0-rc.2-411-gb6d036bf Edition: Community Git Commit Hash: b6d036bfda6ff000d8072163284bc198175477cc Git Branch: master UTC Build Time: 2020-12-28 05:55:51


tidb

root@ce61d5126d28:/disk1/deploy# /disk1/deploy/tidb-4000/bin/tidb-server -V Release Version: v4.0.0-beta.2-1921-g5e67a597c Edition: Community Git Commit Hash: 5e67a597ccdd8220f40d69bc601f1b664949f885 Git Branch: master UTC Build Time: 2020-12-28 07:22:38 GoVersion: go1.13 Race Enabled: false TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306 Check Table Before Drop: false


tikv

root@ce61d5126d28:/disk1/deploy# /disk1/deploy/tikv-20160/bin/tikv-server -V TiKV Release Version: 4.1.0-alpha Edition: Community Git Commit Hash: bd5a30bb0356edde71972e0f33316be4ac1973de Git Commit Branch: master UTC Build Time: 2020-12-28 13:21:51 Rust Version: rustc 1.49.0-nightly (b1496c6e6 2020-10-18) Enable Features: jemalloc mem-profiling portable sse protobuf-codec test-engines-rocksdb Profile: dist_release

cosven commented 3 years ago

pd.log goroutine-pd.txt goroutine-tidb.txt

TiDB and TiKV log files are too big to upload.

AndreMouche commented 3 years ago

It seems these versions are totally different, I do not sure if they can work together. However I find some related log in pd

[WARN] [util.go:79] ["PD version less than cluster version, please upgrade PD"] [PD-version=4.0.0-rc.2-411-gb6d036bf] [cluster-version=4.1.0-alpha]
AndreMouche commented 3 years ago

I think the pd can not work may caused by this, that the pd couldn't campaign his leader since the new leader's information is the same with the old deleted leader, and in this situtation and with only 1 PD, we shouldn't campaign again,related code https://github.com/tikv/pd/blob/master/server/member/member.go#L171?

[2020/12/29 12:42:51.242 +00:00] [ERROR] [member.go:176] ["deleting pd leader key meets error"] [error="[PD:etcd:ErrEtcdTxn]etcd Txn failed"]
[2020/12/29 12:42:51.443 +00:00] [WARN] [member.go:174] ["the pd leader has not changed, delete and campaign again"] [old-pd-leader="name:\"pd-10.0.2.36-2379\" member_id:14280389838364704182 peer_urls:\"http://10.0.2.36:2380\" client_urls:\"http://10.0.2.36:2379\" "]
[2020/12/29 12:42:51.444 +00:00] [ERROR] [member.go:176] ["deleting pd leader key meets error"] [error="[PD:etcd:ErrEtcdTxn]etcd Txn failed"]
AndreMouche commented 3 years ago

I think the pd can not work may caused by this, that the pd couldn't campaign his leader since the new leader's information is the same with the old deleted leader, and in this situtation and with only 1 PD, we shouldn't campaign again,related code tikv/pd@master/server/member/member.go#L171?

[2020/12/29 12:42:51.242 +00:00] [ERROR] [member.go:176] ["deleting pd leader key meets error"] [error="[PD:etcd:ErrEtcdTxn]etcd Txn failed"]
[2020/12/29 12:42:51.443 +00:00] [WARN] [member.go:174] ["the pd leader has not changed, delete and campaign again"] [old-pd-leader="name:\"pd-10.0.2.36-2379\" member_id:14280389838364704182 peer_urls:\"http://10.0.2.36:2380\" client_urls:\"http://10.0.2.36:2379\" "]
[2020/12/29 12:42:51.444 +00:00] [ERROR] [member.go:176] ["deleting pd leader key meets error"] [error="[PD:etcd:ErrEtcdTxn]etcd Txn failed"]

It is not the critical reason, forget it. sorry

ti-srebot commented 3 years ago

Please edit this comment or add a new comment to complete the following information

Not a bug

  1. Remove the 'type/bug' label
  2. Add notes to indicate why it is not a bug

Duplicate bug

  1. Add the 'type/duplicate' label
  2. Add the link to the original bug

Bug

Note: Make Sure that 'component', and 'severity' labels are added Example for how to fill out the template: https://github.com/pingcap/tidb/issues/20100

1. Root Cause Analysis (RCA) (optional)

After https://github.com/tikv/pd/pull/3305, the id allocator will alloc duplicate id when restart, which let the regions have a hold. the key is located in the hold and will not found the corresponding region.

2. Symptom (optional)

3. All Trigger Conditions (optional)

region have duplicated ids.

4. Workaround (optional)

5. Affected versions

master after https://github.com/tikv/pd/pull/3305

6. Fixed versions

master after https://github.com/tikv/pd/pull/3322

ti-srebot commented 3 years ago

( FixedVersions AffectedVersions ) fields are empty.