pingcap / tidb

TiDB is an open-source, cloud-native, distributed, MySQL-Compatible database for elastic scale and real-time analytics. Try AI-powered Chat2Query free at : https://www.pingcap.com/tidb-serverless/
https://pingcap.com
Apache License 2.0
36.94k stars 5.81k forks source link

lightning import failed when inject pd leader network partition with error "tidb lightning encountered error: rpc error: code = Unavailable desc = not leader" #50774

Open Lily2025 opened 7 months ago

Lily2025 commented 7 months ago

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

1、run lightning 2、inject pd leader network partition

2. What did you expect to see? (Required)

lightning can success

3. What did you see instead (Required)

Run Lightning failed. cmd start at 2024-01-27 03:55:51 cmd failed at 2024-01-27 04:05:20 stdout: Verbose debug logs will be written to /tmp/lightning.log.2024-01-26T19.55.51Z

+----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | # | CHECK ITEM | TYPE | PASSED | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 1 | Source csv files size is proper | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 2 | the checkpoints are valid | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 3 | table schemas are valid | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 4 | all importing tables on the target are empty | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 5 | Cluster version check passed | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 6 | Lightning has the correct storage permission | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 7 | local disk resources are rich, estimate sorted data size 78.16GiB, local available is 3.399TiB | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 8 | Cluster available is rich, available is 5.315TiB, we need 234.5GiB | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 9 | TiKV stores (1, 4, 5, 6) contains more than 1000 empty regions respectively, which will greatly affect the import speed and succes | performance | false | | | s rate | | | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 10 | Cluster region distribution is balanced | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 11 | no CDC or PiTR task found | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+

{"level":"warn","ts":"2024-01-26T19:56:57.392Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T19:57:19.737Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cada40/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-6330128-1-657.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T19:57:23.419Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T19:57:29.423Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = Unknown desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T19:57:45.742Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cada40/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-6330128-1-657.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:00:57.340Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:01:21.793Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cada40/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-6330128-1-657.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:01:23.345Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:01:29.346Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:01:37.796Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cada40/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-6330128-1-657.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:05:03.832Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cada40/tc-pd-2.tc-pd-peer.ha-test-lightning-tps-6330128-1-657.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"warn","ts":"2024-01-26T20:05:10.420Z","logger":"etcd-client","caller":"v3@v3.5.2/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cbfc00/tc-pd:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} tidb lightning encountered error: rpc error: code = Unavailable desc = not leader

error is from here: rpc error: code = Unavailable desc = not leader github.com/tikv/pd/client.(*client).respForErr /go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230904040343-947701a32c05/client.go:1942 github.com/tikv/pd/client.(*client).UpdateServiceGCSafePoint /go/pkg/mod/github.com/tikv/pd/client@v0.0.0-20230904040343-947701a32c05/client.go:1664 github.com/pingcap/tidb/br/pkg/lightning/restore.(*gcTTLManager).doUpdateGCTTL /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/checksum.go:455 github.com/pingcap/tidb/br/pkg/lightning/restore.(*gcTTLManager).addOneJob /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/checksum.go:411 github.com/pingcap/tidb/br/pkg/lightning/restore.(*tikvChecksumManager).Checksum /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/checksum.go:341 github.com/pingcap/tidb/br/pkg/lightning/restore.DoChecksum /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/checksum.go:161 github.com/pingcap/tidb/br/pkg/lightning/restore.(*TableRestore).postProcess /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/table_restore.go:874 github.com/pingcap/tidb/br/pkg/lightning/restore.(*Controller).restoreTables.func7.1 /home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/restore/restore.go:1695 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1594

4. What is your TiDB version? (Required)

./tidb-server -V Release Version: v6.5.8 Edition: Community Git Commit Hash: 44f39e98a29e19bc00f2d6c1a7ab1609a7a9e822 Git Branch: heads/refs/tags/v6.5.8 UTC Build Time: 2024-01-26 04:29:55 GoVersion: go1.19.13 Race Enabled: false TiKV Min Version: 6.2.0-alpha Check Table Before Drop: false Store: unistore 2024-01-26T19:59:32.387+0800

Lily2025 commented 7 months ago

/assign lance6716

Lily2025 commented 7 months ago

/type bug /severity major

Lily2025 commented 7 months ago

need lightning retry or pd client retry.

Lily2025 commented 6 months ago

lightning import failed when inject pd leader network partition on 2024.3.6

cmd start at 2024-03-06 03:13:31 cmd failed at 2024-03-06 03:25:29 stdout: Verbose debug logs will be written to /tmp/lightning.log.2024-03-05T19.13.31Z

+----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | # | CHECK ITEM | TYPE | PASSED | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 1 | Source data files size is proper | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 2 | the checkpoints are valid | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 3 | table schemas are valid | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 4 | Cluster version check passed | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 5 | Lightning has the correct storage permission | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 6 | local disk resources are rich, estimate sorted data size 58.08GiB, local available is 3.263TiB | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 7 | The storage space is rich, which TiKV/Tiflash is 5.396TiB/0B. The estimated storage space is 174.2GiB/0B. | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 8 | TiKV stores (12, 14, 13, 1) contains more than 1000 empty regions respectively, which will greatly affect the import speed and suc | performance | false | | | cess rate | | | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 9 | Cluster region distribution is balanced | performance | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+ | 10 | no CDC or PiTR task found | critical | true | +----+------------------------------------------------------------------------------------------------------------------------------------+-------------+--------+

{"level":"warn","ts":"2024-03-05T19:14:41.645821Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001988000/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"} {"level":"warn","ts":"2024-03-05T19:14:43.460519Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001b48540/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"info","ts":"2024-03-05T19:14:43.460598Z","logger":"etcd-client","caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} {"level":"warn","ts":"2024-03-05T19:14:59.036818Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000dc2fc0/tc-pd:2379","attempt":0,"error":"rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"} {"level":"warn","ts":"2024-03-05T19:18:35.956798Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000dc2fc0/tc-pd:2379","attempt":0,"error":"rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"} {"level":"warn","ts":"2024-03-05T19:18:41.858456Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001988000/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"} {"level":"warn","ts":"2024-03-05T19:18:44.242512Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ec3880/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"info","ts":"2024-03-05T19:18:44.242595Z","logger":"etcd-client","caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} {"level":"warn","ts":"2024-03-05T19:18:48.805663Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001b48540/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"info","ts":"2024-03-05T19:18:48.805745Z","logger":"etcd-client","caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} {"level":"warn","ts":"2024-03-05T19:22:45.056322Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001988000/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"} {"level":"warn","ts":"2024-03-05T19:22:53.473885Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000dc2fc0/tc-pd:2379","attempt":0,"error":"rpc error: code = Unavailable desc = keepalive ping failed to receive ACK within timeout"} {"level":"warn","ts":"2024-03-05T19:22:53.920184Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001b48540/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"info","ts":"2024-03-05T19:22:53.920217Z","logger":"etcd-client","caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} {"level":"warn","ts":"2024-03-05T19:23:19.368691Z","logger":"etcd-client","caller":"v3@v3.5.12/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000ec3880/tc-pd-0.tc-pd-peer.ha-test-lightning-tps-7170220-1-418.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"} {"level":"info","ts":"2024-03-05T19:23:19.368739Z","logger":"etcd-client","caller":"v3@v3.5.12/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"} tidb lightning encountered error: [Lightning:Restore:ErrRestoreTable]restore table location.GooglePoiSource failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster