pingcap / tidb

TiDB - the open-source, cloud-native, distributed SQL database designed for modern applications.
https://pingcap.com
Apache License 2.0
37.39k stars 5.85k forks source link

add index failed with error “Error 1105 (HY000): get member failed” when inject network partition between pd leader and all pd follower #52807

Open Lily2025 opened 7 months ago

Lily2025 commented 7 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、with tidb_enable_dist_task='off' 2、run sysbench 3、add index for one of table 4、inject network partition between pd leader and all pd follower

tidb logs: tidb-0.tar.gz tidb-1.tar.gz

2. What did you expect to see? (Required)

add index can success

3. What did you see instead (Required)

add index failed with error “Error 1105 (HY000): get member failed”

add index failed at 2024-04-19 23:47:03: Error 1105 (HY000): get member failed operatorLogs: [2024-04-19 23:46:49] ###### start adding index ALTER TABLE sbtest1 ADD INDEX index_test_1713541609102(c) [2024-04-19 23:46:49] ###### wait for ddl job finish

tidb log [2024/04/19 23:47:02.012 +08:00] [ERROR] [terror.go:324] ["encountered error"] [error="close tcp 10.233.126.115:4000->172.16.6.4:47418: use of closed network connection"] [stack="github.com/pingcap/tidb/parser/terror.Log\n\t/workspace/source/tidb/parser/terror/terror.go:324\ngithub.com/pingcap/tidb/server.(*Server).onConn\n\t/workspace/source/tidb/server/server.go:688"] [2024/04/19 23:47:02.166 +08:00] [INFO] [pd_service_discovery.go:439] ["[pd] cannot update member from this address"] [address=http://tc-pd:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"] [2024/04/19 23:47:02.878 +08:00] [INFO] [coprocessor.go:1288] ["[TIME_COP_PROCESS] resp_time:893.815869ms txnStartTS:18446744073709551615 region_id:46244 store_addr:tc-tikv-0.tc-tikv-peer.endless-ha-test-add-index-tps-7569302-1-197.svc:20160 kv_process_ms:878 kv_wait_ms:0 kv_read_ms:0 processed_versions:410870 total_versions:410871 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:410870 rocksdb_cache_hit_count:5 rocksdb_read_count:1500 rocksdb_read_byte:40060712"] [2024/04/19 23:47:03.166 +08:00] [ERROR] [pd.go:307] ["fail to create pd client"] [error="[PD:client:ErrClientGetMember]get member failed"] [2024/04/19 23:47:03.166 +08:00] [ERROR] [backend_mgr.go:96] ["[ddl-ingest] build ingest backend failed"] ["job ID"=473] [error="[Lightning:PD:ErrCreatePDClient]create pd client error: [PD:client:ErrClientGetMember]get member failed"] [2024/04/19 23:47:03.166 +08:00] [WARN] [terror.go:242] ["Unknown error class"] [class=PD] [2024/04/19 23:47:03.166 +08:00] [INFO] [ddl_worker.go:956] ["[ddl] DDL job is cancelled normally"] [worker="worker 2, tp add index"] [error="[Lightning:PD:ErrCreatePDClient]create pd client error: [PD:client:ErrClientGetMember]get member failed"] [2024/04/19 23:47:03.168 +08:00] [INFO] [ddl_worker.go:603] ["[ddl] finish DDL job"] [worker="worker 2, tp add index"] [job="ID:473, Type:add index, State:cancelled, SchemaState:none, SchemaID:88, TableID:245, RowCount:0, ArgLen:6, start time: 2024-04-19 23:46:49.096 +0800 CST, Err:[PD:client:ErrClientGetMember]get member failed, ErrCount:1, SnapshotVersion:0, UniqueWarnings:0"] [2024/04/19 23:47:03.185 +08:00] [INFO] [ddl.go:1181] ["[ddl] DDL job is failed"] [jobID=473]

4. What is your TiDB version? (Required)

./tidb-server -V Release Version: v7.1.5 Edition: Community Git Commit Hash: https://github.com/pingcap/tidb/commit/f4faa508e58826d8c44382ac429717290271e7e2 Git Branch: HEAD UTC Build Time: 2024-04-19 10:49:04 GoVersion: go1.20.10 Race Enabled: false TiKV Min Version: 6.2.0-alpha Check Table Before Drop: false Store: unistore 2024-04-20T07:08:24.815+0800

Lily2025 commented 7 months ago

/assign JmPotato /severity major

JmPotato commented 1 week ago

To solve this problem, we either need to add retry logic to the initialization of the PD client within Lightning or increase the retry effort within the PD client's internal GetMember initialization. However, introducing the latter inevitably faces a problem: Should we persist in retrying in all cases? What should be the maximum retry time? This does not seem to be a fundamental solution. If the initialization of the PD client is critical to the caller, I believe the logic for such retries should be decided by the upper caller rather than the PD client.